I’m tagging this with PHP even though it’s a Java question. The regex is copied from a PHP source so I’m hoping some PHPers can help with the question.
I decided to build a simple spam filter, just for fun, and I copied the spam blocklist from MediaWiki: https://meta.wikimedia.org/wiki/Spam_blacklist
Mostly this seems to work, but a few of the patterns fail with a syntax error. I don’t know if this is a typo or if PHP uses a different syntax than Java. Can anyone help me fixing these regex so that they compile?
Here’s the problems:
java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 17 bfacebo(?:o[ob]|?o)k.comb ^ java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 5 b????.tkb ^ java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0 ??.xsl.ptb ^ java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 4 b????.shopb ^ java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 4 b???.??b ^
Here’s the code that compiles them, in case you’re interested. I don’t think it makes a difference though.
private static synchronized void init() throws IOException { if( blackListPatterns.get() != null ) return; InputStream blacklistfile = SpamBlackList.class.getResourceAsStream( "blacklist.txt" ); BufferedReader buf = new BufferedReader( new InputStreamReader( blacklistfile, "UTF-8" ) ); ArrayList<String> blacklist = new ArrayList<>( 12000 ); for( String line; (line = buf.readLine()) != null; ) if( !line.isBlank() && line.trim().charAt(0) != '#' ) blacklist.add( line ); ArrayList<Pattern> tempPatterns = new ArrayList<>( blacklist.size() ); for( String pat : blacklist ) try { tempPatterns.add( Pattern.compile( pat ) ); } catch ( java.util.regex.PatternSyntaxException ex ) { System.err.println( ex ); // should log this, low level like FINER } blackListPatterns = new WeakReference<>( tempPatterns ); } private static volatile WeakReference<List<Pattern>> blackListPatterns = new WeakReference( null );
Advertisement
Answer
Your downloaded copy of https://meta.wikimedia.org/wiki/Spam_blacklist (blacklist.txt
) is corrupt. The dangling question marks are non-ASCII characters, e.g. bfacebo(?:o[ob]|?o)k.comb
is actually bfacebo(?:o[ob]|ıo)k.comb
. Note the dotless “ı”.
Download https://meta.wikimedia.org/wiki/Spam_blacklist?action=raw and take into account that it is UTF-8.
And you may want to pass Unicode flag to the regular expressions. Also take into account that:
What is referred to here as regular expressions are not proper regular expressions, but rather subpatterns that are inserted into a hard-coded regular expression. i.e. the subpattern Foo from above would create a regular expression like /^Foo$/usi.
(see https://www.mediawiki.org/wiki/Extension:TitleBlacklist#Block_list).