How to handle user input of invalid UTF-8 characters

Question

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users. Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around. W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be

Accepted Answer

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example&#8230;I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.Here is an example using iconv():$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);If you want to display an error message to your users I&#8217;d probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:function utf8_clean($str){    return iconv('UTF-8', 'UTF-8//IGNORE', $str);}$clean_GET = array_map('utf8_clean', $_GET);if (serialize($_GET) != serialize($clean_GET)){    $_GET = $clean_GET;    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';}// $_GET is clean!You may also want to normalize new lines and strip (non-)visible control chars, like this:function Clean($string, $control = true){    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);    if ($control === true)    {            return preg_replace('~p{C}+~u', '', $string);    }    return preg_replace(array('~rn?~', '~[^P{C}tn]+~u'), array("n", ''), $string);}Code to convert from UTF-8 to Unicode code points:function Codepoint($char){    $result = null;    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));    if (is_array($codepoint) && array_key_exists(1, $codepoint))    {        $result = sprintf('U+%04X', $codepoint[1]);    }    return $result;}echo Codepoint('à'); // U+00E0echo Codepoint('ひ'); // U+3072It is probably faster than any other alternative, but I haven&#8217;t tested it extensively though.Example:$string = 'hello world�';// U+FFFEhello worldU+FFFDecho preg_replace_callback('/[p{So}p{Cf}p{Co}p{Cs}p{Cn}]/u', 'Bad_Codepoint', $string);function Bad_Codepoint($string){    $result = array();    foreach ((array) $string as $char)    {        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));        if (is_array($codepoint) && array_key_exists(1, $codepoint))        {            $result[] = sprintf('U+%04X', $codepoint[1]);        }    }    return implode('', $result);}This may be what you were looking for.

Advertisement

Answer