I’m looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP’s json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says “If non-UTF-8 data is received, an error message should be sent back.”.
- How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
- How do you present the error in a helpful way to the user?
- How do you temporarily store and display bad form data so the user doesn’t lose all their text? Strip bad characters? Use a replacement character, and how?
- For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I’m very familiar with the mbstring extension and am not asking “how does UTF-8 work in PHP?”. I’d like advice from people with experience in real-world situations how they’ve handled this.
As part of the solution, I’d really like to see a fast method to convert invalid characters to U+FFFD.
Advertisement
Answer
The accept-charset="UTF-8"
attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example…
I usually ignore bad characters, either via iconv()
or with the less reliable utf8_encode()
/ utf8_decode()
functions. If you use iconv
, you also have the option to transliterate bad characters.
Here is an example using iconv()
:
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str); $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I’d probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str) { return iconv('UTF-8', 'UTF-8//IGNORE', $str); } $clean_GET = array_map('utf8_clean', $_GET); if (serialize($_GET) != serialize($clean_GET)) { $_GET = $clean_GET; $error_msg = 'Your data is not valid UTF-8 and has been stripped.'; } // $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true) { $string = iconv('UTF-8', 'UTF-8//IGNORE', $string); if ($control === true) { return preg_replace('~p{C}+~u', '', $string); } return preg_replace(array('~rn?~', '~[^P{C}tn]+~u'), array("n", ''), $string); }
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char) { $result = null; $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result = sprintf('U+%04X', $codepoint[1]); } return $result; } echo Codepoint('à'); // U+00E0 echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven’t tested it extensively though.
Example:
$string = 'hello world�'; // U+FFFEhello worldU+FFFD echo preg_replace_callback('/[p{So}p{Cf}p{Co}p{Cs}p{Cn}]/u', 'Bad_Codepoint', $string); function Bad_Codepoint($string) { $result = array(); foreach ((array) $string as $char) { $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result[] = sprintf('U+%04X', $codepoint[1]); } } return implode('', $result); }
This may be what you were looking for.