The preg_match
must match any of the words in the $string
variable (as long as they are at least 3 chars long) with any of words in the $forbidden
array, but here’s the issue:
If the $string
contains the word mamíferos (with an accent char) instead of mamiferos, it should also be a match. Same applies if acompañar is in the forbidden array list, but the user decides to type acompanar instead (without the accent char).
$forbidden = array('mamiferos', 'acompañar'); $string = 'los mamíferos corren libres y quieren acompanar a su madre'; if(preg_match('/b(?:'.implode('|', $forbidden).'){3,}/i', $string)) { echo 'match!'; } else { echo 'nope...'; }
Advertisement
Answer
I suggest a solution based on removing any combining Unicode characters from both the filtered string and the forbidden words. It will require intl
extension (sudo apt install php7.4-intl && sudo phpenmod intl
). Firstly, it decomposes the Uncode string into characters and combining letter modifiers, secondly, it removes all modifiers (p{M}
):
<?php $string = 'los mamíferos corren libres y quieren acompanar a su madre'; $forbidden = ['mamiferos', 'acompañar']; function strip (string $accented): string { $decomposed = Normalizer::normalize ($accented, Normalizer::FORM_D); return preg_replace ('/p{M}/u', '', $decomposed); } function filter (string $string, array $words): bool { $regex = '/b(?:' . implode ('|', $words) . ')/i'; return preg_match (strip ($regex), strip ($string)); } echo ((filter ($string, $forbidden) ? 'match!' : 'nope...') . "n");
By the way, I don’t understand the meaning of {3,}
in your regular expression, and I removed it from mine. If you think that it will match a string with three or more forbidden words, you are mistaken: the forbidden words will match only if they immediately follow each other.
Further reading: https://www.php.net/manual/en/class.normalizer.