Skip to content
Advertisement

How to handle user input of invalid UTF-8 characters

I’m looking for a general strategy/advice on how to handle invalid UTF-8 input from users.

Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP’s json_encode() and overall seems like a bad idea to have around.

W3C I18N FAQ: Multilingual Forms says “If non-UTF-8 data is received, an error message should be sent back.”.

  • How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
  • How do you present the error in a helpful way to the user?
  • How do you temporarily store and display bad form data so the user doesn’t lose all their text? Strip bad characters? Use a replacement character, and how?
  • For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

I’m very familiar with the mbstring extension and am not asking “how does UTF-8 work in PHP?”. I’d like advice from people with experience in real-world situations how they’ve handled this.

As part of the solution, I’d really like to see a fast method to convert invalid characters to U+FFFD.

Advertisement

Answer

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example…

I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.

Here is an example using iconv():

JavaScript

If you want to display an error message to your users I’d probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:

JavaScript

You may also want to normalize new lines and strip (non-)visible control chars, like this:

JavaScript

Code to convert from UTF-8 to Unicode code points:

JavaScript

It is probably faster than any other alternative, but I haven’t tested it extensively though.


Example:

JavaScript

This may be what you were looking for.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement