When do I need u-modifier in PHP regex?

Question

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8. But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this. For example preg_match('/^[da-f]{40}$/', $string); to check

Accepted Answer

There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.The second expression may give you more spacers than you expect; for example:echo preg_replace('/[^a-zA-Z0-9]/', "0", "ð©");// => 0000The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).This is more dangerous: echo preg_replace('/^(.)/', "0", "ð©");// => 0???Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.

Advertisement

Answer