I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u
modifier for handling input and regex as UTF-8.
But, do I really need this always? My tests show, that this flag makes no difference, when I don’t use escape sequences or dot or something like this.
For example
preg_match('/^[da-f]{40}$/', $string);
to check if string has format of a SHA1 hash
preg_replace('/[^a-zA-Z0-9]/', $spacer, $string);
to replace every char that is non-ASCII letter or number
preg_replace('/^+((.*))$/', '1', $string);
for getting inner content of +(XYZ)
These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn’t it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?
Cannot anyone tell me, if I’m overlooking something?
Advertisement
Answer
There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.
The second expression may give you more spacers than you expect; for example:
echo preg_replace('/[^a-zA-Z0-9]/', "0", "ð©"); // => 0000
The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).
This is more dangerous:
echo preg_replace('/^(.)/', "0", "ð©"); // => 0???
Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u
for all text that might contain a character above U+007F is the best practice.