Multibyte trim in PHP?

Question

Apparently there&#8217;s no mb_trim in the mb_* family, so I&#8217;m trying to implement one for my own. I recently found this regex in a comment in php.net: /(^s+)|(s+$)/u So, I&#8217;d implement it in the &#8230;

Accepted Answer

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000.Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx.This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx can only refer to single-byte characters. PHP&#8217;s trim function will therefore never trim away &#8220;half a character&#8221; assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)The s  on ASCII regular expressions will mostly match the same characters as trim.The preg functions with the /u modifier only works on UTF-8 encoded regular expressions, and /s/u match also the UTF8&#8217;s nbsp. This behaviour with non-breaking spaces is the only advantage to using it.If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.In other words, if you&#8217;re trying to trim usual spaces an ASCII-compatible string, just use trim. When using /s/u be careful with the meaning of nbsp for your text.Take care:  $s1 = html_entity_decode(" Hello &#160; "); // the NBSP  $s2 = " ???? exotic test ホ ???? ";  echo "nCORRECT trim: [". trim($s1) ."], [".  trim($s2) ."]";  echo "nSAME: [". trim($s1) ."] == [". preg_replace('/^s+|s+$/','',$s1) ."]";  echo "nBUT: [". trim($s1) ."] != [". preg_replace('/^s+|s+$/u','',$s1) ."]";  echo "n!INCORRECT trim: [". trim($s2,'???? ') ."]"; // DANGER! not UTF8 safe!  echo "nSAFE ONLY WITH preg: [".        preg_replace('/^[????s]+|[????s]+$/u', '', $s2) ."]";

Advertisement

Answer