Skip to content
Advertisement

Java Regex vs. PHP, Dangling meta character ‘?’

I’m tagging this with PHP even though it’s a Java question. The regex is copied from a PHP source so I’m hoping some PHPers can help with the question.

I decided to build a simple spam filter, just for fun, and I copied the spam blocklist from MediaWiki: https://meta.wikimedia.org/wiki/Spam_blacklist

Mostly this seems to work, but a few of the patterns fail with a syntax error. I don’t know if this is a typo or if PHP uses a different syntax than Java. Can anyone help me fixing these regex so that they compile?

Here’s the problems:

JavaScript

Here’s the code that compiles them, in case you’re interested. I don’t think it makes a difference though.

JavaScript

Advertisement

Answer

Your downloaded copy of https://meta.wikimedia.org/wiki/Spam_blacklist (blacklist.txt) is corrupt. The dangling question marks are non-ASCII characters, e.g. bfacebo(?:o[ob]|?o)k.comb is actually bfacebo(?:o[ob]|ıo)k.comb. Note the dotless “ı”.

Download https://meta.wikimedia.org/wiki/Spam_blacklist?action=raw and take into account that it is UTF-8.

And you may want to pass Unicode flag to the regular expressions. Also take into account that:

What is referred to here as regular expressions are not proper regular expressions, but rather subpatterns that are inserted into a hard-coded regular expression. i.e. the subpattern Foo from above would create a regular expression like /^Foo$/usi.

(see https://www.mediawiki.org/wiki/Extension:TitleBlacklist#Block_list).

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement