Codeigniter 4 localization issue with UTF-8 characters

Question

The context I am trying to perform a lemmatization of some texts and I figured out that I can use CI4 localization for this. Basically I created some files in AppLanguagesro-RO and I am "translating" the words to their linguistic root. The language files are encoded in UTF-8 (I checked it on the server with file -i command). PHP has

Accepted Answer

Reading a lot of articles related to Unicode and Romanian diacritics, I learned a bit of history. It seems that Microsoft has made an error in the early days of Romanian charsets and misrepresented ș and ț as ş and ţ. You probably don&#8217;t notice a difference, but there is: the first characters have a comma underneath s and t, while the latter have a cedilla. It is barely noticeable to human eye, but computers are more perceptive :-). Unicode was the first standard to correct this mistake, but it created a world of problems, as a lot of the data were already using the &#8220;wrong&#8221; characters. This made text searching a lot more complicated for Romanian language.That was actually my problem too. The language files were properly encoded in UTF-8, but they were created most probably with ISO-8859-2, containing the &#8220;wrong&#8221; characters for ș and ț. I just needed to do a complete replace of the &#8220;wrong&#8221; characters with the &#8220;right&#8221; ones.I documented this here hoping that it will help someone in the future, as noticing the tiny visual difference between the two representations of ș and ț is not a piece of cake. It cost me a day of frustration.

Advertisement

Answer