Apparently there’s no mb_trim
in the mb_*
family, so I’m trying to implement one for my own.
I recently found this regex in a comment in php.net:
/(^s+)|(s+$)/u
So, I’d implement it in the following way:
function multibyte_trim($str) { if (!function_exists("mb_trim") || !extension_loaded("mbstring")) { return preg_replace("/(^s+)|(s+$)/u", "", $str); } else { return mb_trim($str); } }
The regex seems correct to me, but I’m extremely noob with regular expressions. Will this effectively remove any Unicode space in the beginning/end of a string?
Advertisement
Answer
The standard trim
function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0
to 0100 0000
.
Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx
. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx
.
This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx
can only refer to single-byte characters. PHP’s trim
function will therefore never trim away “half a character” assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)
The s
on ASCII regular expressions will mostly match the same characters as trim
.
The preg
functions with the /u
modifier only works on UTF-8 encoded regular expressions, and /s/u
match also the UTF8’s nbsp. This behaviour with non-breaking spaces is the only advantage to using it.
If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.
In other words, if you’re trying to trim usual spaces an ASCII-compatible string, just use trim
. When using /s/u
be careful with the meaning of nbsp for your text.
Take care:
$s1 = html_entity_decode(" Hello "); // the NBSP $s2 = " ???? exotic test ホ ???? "; echo "nCORRECT trim: [". trim($s1) ."], [". trim($s2) ."]"; echo "nSAME: [". trim($s1) ."] == [". preg_replace('/^s+|s+$/','',$s1) ."]"; echo "nBUT: [". trim($s1) ."] != [". preg_replace('/^s+|s+$/u','',$s1) ."]"; echo "n!INCORRECT trim: [". trim($s2,'???? ') ."]"; // DANGER! not UTF8 safe! echo "nSAFE ONLY WITH preg: [". preg_replace('/^[????s]+|[????s]+$/u', '', $s2) ."]";