My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli)
. However it looks like this problem occurs with all versions I’ve tested.
I’ve encountered a very weird bug when using preg_match. Anyone know a fix?
The first section of code here works, the second one doesn’t. But the regex itself is valid. For some reason the something_happened
word is causing it to fail.
$one = ' (branch|leaf)'; echo "ONE:n"; preg_match('/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/', $one, $matches, PREG_OFFSET_CAPTURE); print_r($matches); // this works $two = 'something_happened (branch|leaf)'; echo "nTWO:n"; preg_match('/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/', $two, $matches2, PREG_OFFSET_CAPTURE); print_r($matches2); // this doesn't work
It seems somehow related to the word something_happened
. If I change this word it works.
The regex is matching 2 or more type names separated by |
that may or may not be surrounded in ()
, and each type name may or may not be preceded by any number of []
(or [some number]
or [!some number]
) and *
.
Try it and see for yourself! Please let me know if you know how to fix it!
Advertisement
Answer
The problem lies in the (?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)+
group: the +
quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.
In PHP, you can workaround the problem by using either
- Possessive quantifier:
'/(?:( ?)?((?:(?:**[(?:!?d+)?])***[A-Za-z_]w*)++(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/'
Note the ++
at the end of the group mentioned.
2. Atomic group:
'/(?:( ?)?((?>(?:**[(?:!?d+)?])***[A-Za-z_]w*)+(?: ?| ?(?:**[(?:!?d+)?])***[A-Za-z_]w*)+)(?: ?))?/'
See this regex demo. Note the (?>...)
syntax.
Also, note how the regex is formatted here, it is very convenient to use the x
(extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and #
chars, but it is a minor inconvenience when it comes to debugging long patterns like this.