In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?
I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters.
Why not run straight for the mb_
family of functions, as the first couple of answers didn’t?
Advertisement
Answer
You could use the ‘u’ modifier with PCRE regex ; see Pattern Modifiers (quoting) :
u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
For instance, considering this code :
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder $str = "abc 文字化け, efg"; $results = array(); preg_match_all('/./', $str, $results); var_dump($results[0]);
You’ll get an unusable result:
array 0 => string 'a' (length=1) 1 => string 'b' (length=1) 2 => string 'c' (length=1) 3 => string ' ' (length=1) 4 => string '�' (length=1) 5 => string '�' (length=1) 6 => string '�' (length=1) 7 => string '�' (length=1) 8 => string '�' (length=1) 9 => string '�' (length=1) 10 => string '�' (length=1) 11 => string '�' (length=1) 12 => string '�' (length=1) 13 => string '�' (length=1) 14 => string '�' (length=1) 15 => string '�' (length=1) 16 => string ',' (length=1) 17 => string ' ' (length=1) 18 => string 'e' (length=1) 19 => string 'f' (length=1) 20 => string 'g' (length=1)
But, with this code :
header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder $str = "abc 文字化け, efg"; $results = array(); preg_match_all('/./u', $str, $results); var_dump($results[0]);
(Notice the ‘u’ at the end of the regex)
You get what you want :
array 0 => string 'a' (length=1) 1 => string 'b' (length=1) 2 => string 'c' (length=1) 3 => string ' ' (length=1) 4 => string '文' (length=3) 5 => string '字' (length=3) 6 => string '化' (length=3) 7 => string 'け' (length=3) 8 => string ',' (length=1) 9 => string ' ' (length=1) 10 => string 'e' (length=1) 11 => string 'f' (length=1) 12 => string 'g' (length=1)
Hope this helps 🙂