Only today I realized that I was missing this in my PHP scripts:
mysql_set_charset('utf8');
All my tables are InnoDB, collation “utf8_unicode_ci”, and all my VARCHAR columns are “utf8_unicode_ci” as well. I have mb_internal_encoding('UTF-8');
on my PHP scripts, and all my PHP files are encoded as UTF-8.
So, until now, every time I “INSERT” something with diacritics, example:
mysql_query('INSERT INTO `table` SET `name`="Jáuò Iñe"');
The ‘name’ contents would be, in this case: Jáuò Iñe
.
Since I fixed the charset between PHP and MySQL, new INSERTs are now storing correctly. However, I want to fix all the older rows that are “messed” at the moment. I tried many things already, but it always breaks the strings on the first “illegal” character. Here is my current code:
$m = mysql_real_escape_string('¿<?php echo "¬<b>'PHP á (á)ţăriîş </b>"; ?> ă-ţi abcdd;//;ñç´พดแทฝใจคçăâξβψδπλξξςαยนñ ;'); mysql_set_charset('utf8'); mysql_query('INSERT INTO `table` SET `name`="'.$m.'"'); mysql_set_charset('latin1'); mysql_query('INSERT INTO `table` SET `name`="'.$m.'"'); mysql_set_charset('utf8'); $result = mysql_iquery('SELECT * FROM `table`'); while ($row = mysql_fetch_assoc($result)) { $message = $row['name']; $message = mb_convert_encoding($message, 'ISO-8859-15', 'UTF-8'); //$message = iconv("UTF-8", "ISO-8859-1//IGNORE", $message); mysql_iquery('UPDATE `table` SET `name`="'.mysql_real_escape_string($message).'" WHERE `a1`="'.$row['a1'].'"'); }
It “UPDATE”s with the expected characters, except that the string gets truncated after the character “ă”. I mean, that character and following chars are not included on the string.
Also, testing with the “iconv()” (that is commented on the code) does the same, even with //IGNORE and //TRANSLIT
I also tested several charsets, between ISO-8859-1 and ISO-8859-15.
Advertisement
Answer
From what you describe, it seems you have UTF-8 data that was originally stored as Latin-1 and then not converted correctly to UTF-8. The data is recoverable; you’ll need a MySQL function like
convert(cast(convert(name using latin1) as binary) using utf8)
It’s possible that you may need to omit the inner conversion, depending on how the data was altered during the encoding conversion.