Skip to content
Advertisement

Weird PHP Regex Preg_Match Bug?

My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli). However it looks like this problem occurs with all versions I’ve tested.

I’ve encountered a very weird bug when using preg_match. Anyone know a fix?

The first section of code here works, the second one doesn’t. But the regex itself is valid. For some reason the something_happened word is causing it to fail.

$one = ' (branch|leaf)';
echo "ONE:n";
preg_match('/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/', $one, $matches, PREG_OFFSET_CAPTURE);
print_r($matches); // this works

$two = 'something_happened (branch|leaf)';
echo "nTWO:n";
preg_match('/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/', $two, $matches2, PREG_OFFSET_CAPTURE);
print_r($matches2); // this doesn't work

It seems somehow related to the word something_happened. If I change this word it works.

The regex is matching 2 or more type names separated by | that may or may not be surrounded in (), and each type name may or may not be preceded by any number of [] (or [some number] or [!some number]) and *.

Try it and see for yourself! Please let me know if you know how to fix it!

Advertisement

Answer

The problem lies in the (?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+ group: the + quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.

In PHP, you can workaround the problem by using either

  1. Possessive quantifier:
'/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)++(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/'

Note the ++ at the end of the group mentioned. 2. Atomic group:

'/(?:( ?)?((?>(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/'

See this regex demo. Note the (?>...) syntax.

Also, note how the regex is formatted here, it is very convenient to use the x (extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and # chars, but it is a minor inconvenience when it comes to debugging long patterns like this.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement