I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|.|!|?){1,4000}$/i
.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37
and the subject fails to be tested.
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
Why is limiting the repetition to 4000 a problem, but infinite repetition not?
regex-test.php:
<?php $infinite = "/^([a-z]|[0-9]| |,|'|.|!|?)*$/i"; // Allows infinite repetition $fourk = "/^([a-z]|[0-9]| |,|'|.|!|?){1,4000}$/i"; // Limits repetition to 4000 $string = "I like apples."; if ( preg_match($infinite, $string) ){ echo "Passed infinite repetition. n"; } if ( preg_match($fourk, $string) ){ echo "Passed maximum repetition of 4000. n"; } ?>
echos:
Passed infinite repetition PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
Advertisement
Answer
The error is due to its LINK_SIZE
, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it’s not because of a limit in repetition nor how the pattern is interpreted when compiled.
In this case
As Alan Moore pointed out in his answer, all characters should be in the same character class. I’m more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It’s just an attempt to underline that in no way such constructs should be used.
There are 3 common pitfalls here in (x|y|z){1,4000}
:
- Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
- Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases. - Alternation (with the
|
s) adds backtracking states. It’s a good practice to try to reduce them as much as you can. In this case, the regex^[ !',.0-9?A-Z]{1,4000}$/i
, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
From “Handling Very Large Patterns“ in pcrebuild man page:
Within a compiled pattern, offset values are used to point from one part to another (for example, from an opening parenthesis to an alternation metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values are used for these offsets, leading to a maximum size for a compiled pattern of around 64K.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE keeps offsets in its compiled code as 2-byte quantities (always stored in big-endian order) by default. These are used, for example, to link from the start of a subpattern to its alternatives and its end. The use of 2 bytes per offset limits the size of the compiled regex to around 64K, which is big enough for almost everybody.
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28 /^([a-z]|[0-9]| |,|'|.|!|?){1,575}$/i Failed: regular expression is too large at offset 36 /^([a-z]|[0-9]| |,|'|.|!|?){1,574}$/i Memory allocation (code space): 65432
- There’s a Windows version you can download from RexEgg.com.
Regarding other size limitations in PCRE, you can check this post of mine.
Overriding the default LINK_SIZE
in PHP
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won’t be portable from now on). It should be the last resort, provided there’s no other choice.
Also commented in pcre_internal.h:
The macros are controlled by the value of
LINK_SIZE
. This defaults to 2 in the config.h file, but can be overridden by using-D
on the command line. This is automated on Unix systems via the “configure” command.
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
But keep in mind that longer offsets require additional data, and it will slow down all calls to preg_*
functions.
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.