What is the best way to split a string into an array of Unicode characters in PHP?

Question

In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8? I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters. Why not run straight for the mb_ family of functions, as the firs…

Accepted Answer

You could use the &#8216;u&#8217; modifier with PCRE regex ; see Pattern Modifiers (quoting) :  u (PCRE8)    This modifier turns on additional  functionality of PCRE that is  incompatible with Perl. Pattern  strings are treated as UTF-8. This  modifier is available from PHP 4.1.0  or greater on Unix and from PHP 4.2.3  on win32. UTF-8 validity of the  pattern is checked since PHP 4.3.5.For instance, considering this code :header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder$str = "abc 文字化け, efg";$results = array();preg_match_all('/./', $str, $results);var_dump($results[0]);You&#8217;ll get an unusable result:array  0 => string 'a' (length=1)  1 => string 'b' (length=1)  2 => string 'c' (length=1)  3 => string ' ' (length=1)  4 => string '�' (length=1)  5 => string '�' (length=1)  6 => string '�' (length=1)  7 => string '�' (length=1)  8 => string '�' (length=1)  9 => string '�' (length=1)  10 => string '�' (length=1)  11 => string '�' (length=1)  12 => string '�' (length=1)  13 => string '�' (length=1)  14 => string '�' (length=1)  15 => string '�' (length=1)  16 => string ',' (length=1)  17 => string ' ' (length=1)  18 => string 'e' (length=1)  19 => string 'f' (length=1)  20 => string 'g' (length=1)But, with this code :header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder$str = "abc 文字化け, efg";$results = array();preg_match_all('/./u', $str, $results);var_dump($results[0]);(Notice the &#8216;u&#8217; at the end of the regex)You get what you want :array  0 => string 'a' (length=1)  1 => string 'b' (length=1)  2 => string 'c' (length=1)  3 => string ' ' (length=1)  4 => string '文' (length=3)  5 => string '字' (length=3)  6 => string '化' (length=3)  7 => string 'け' (length=3)  8 => string ',' (length=1)  9 => string ' ' (length=1)  10 => string 'e' (length=1)  11 => string 'f' (length=1)  12 => string 'g' (length=1)Hope this helps 🙂

Advertisement

Answer