Skip to content
Advertisement

Codeigniter 4 localization issue with UTF-8 characters

The context

I am trying to perform a lemmatization of some texts and I figured out that I can use CI4 localization for this. Basically I created some files in AppLanguagesro-RO and I am “translating” the words to their linguistic root.

The language files are encoded in UTF-8 (I checked it on the server with file -i command). PHP has UTF-8 as default charset. Apache has an AddDefaultCharset UTF-8 setting.

The header of each page has a proper declaration header('Content-Type: text/html; charset=UTF-8');. In CI I configured the App.php with public $charset = 'UTF-8' and public $defaultLocale = 'ro-RO';. In the header of each page served I also put the commands:

JavaScript

The issue

Whenever a label contains the Romanian diacritics ș or ț, the lang() function cannot find the translation. On the contrary, it works fine with the other Romanian diacritics (ă, î and â). Apparently ș and ț are part of the Latin Extended B, while the others are in Latin Extended A.

Curiously, mb_ord() does not return an integer value for any of these diacritics. I made a small function to take each word and display it letter by letter along with the char code. You can see the result ($chunks is an array containing the words, clean_character function is just checking if mb_ord returns an integer) :

JavaScript

I walked the internet up and down, but couldn’t find an explanation for this. I ran out of ideas.

Any thoughts?

Advertisement

Answer

Reading a lot of articles related to Unicode and Romanian diacritics, I learned a bit of history. It seems that Microsoft has made an error in the early days of Romanian charsets and misrepresented ș and ț as ş and ţ. You probably don’t notice a difference, but there is: the first characters have a comma underneath s and t, while the latter have a cedilla. It is barely noticeable to human eye, but computers are more perceptive :-). Unicode was the first standard to correct this mistake, but it created a world of problems, as a lot of the data were already using the “wrong” characters. This made text searching a lot more complicated for Romanian language.

That was actually my problem too. The language files were properly encoded in UTF-8, but they were created most probably with ISO-8859-2, containing the “wrong” characters for ș and ț. I just needed to do a complete replace of the “wrong” characters with the “right” ones.

I documented this here hoping that it will help someone in the future, as noticing the tiny visual difference between the two representations of ș and ț is not a piece of cake. It cost me a day of frustration.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement