Skip to content
Advertisement

When do I need u-modifier in PHP regex?

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.

But, do I really need this always? My tests show, that this flag makes no difference, when I don’t use escape sequences or dot or something like this.

For example

preg_match('/^[da-f]{40}$/', $string); to check if string has format of a SHA1 hash

preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number

preg_replace('/^+((.*))$/', '1', $string); for getting inner content of +(XYZ)

These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn’t it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?

Cannot anyone tell me, if I’m overlooking something?

Advertisement

Answer

There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.

The second expression may give you more spacers than you expect; for example:

echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000

The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).

This is more dangerous:

echo preg_replace('/^(.)/', "0", "💩");
// => 0???

Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement