The context
I am trying to perform a lemmatization of some texts and I figured out that I can use CI4 localization for this. Basically I created some files in AppLanguagesro-RO and I am “translating” the words to their linguistic root.
The language files are encoded in UTF-8 (I checked it on the server with file -i
command). PHP has UTF-8 as default charset. Apache has an AddDefaultCharset UTF-8
setting.
The header of each page has a proper declaration header('Content-Type: text/html; charset=UTF-8');
. In CI I configured the App.php with public $charset = 'UTF-8'
and public $defaultLocale = 'ro-RO';
. In the header of each page served I also put the commands:
mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_http_input('UTF-8'); mb_regex_encoding('UTF-8');
The issue
Whenever a label contains the Romanian diacritics ș or ț, the lang() function cannot find the translation. On the contrary, it works fine with the other Romanian diacritics (ă, î and â). Apparently ș and ț are part of the Latin Extended B, while the others are in Latin Extended A.
Curiously, mb_ord() does not return an integer value for any of these diacritics. I made a small function to take each word and display it letter by letter along with the char code. You can see the result ($chunks
is an array containing the words, clean_character
function is just checking if mb_ord returns an integer) :
private function displayTextInfo( $chunks ) { for ($i=0; $i<count($chunks); $i++): echo $chunks[$i] . ' - '; for ($j=0; $j<strlen($chunks[$i]); $j++): $char = substr($chunks[$i], $j, 1); if ( $this->clean_character($char) ) { echo $char . '(' . mb_ord( $char, 'UTF-8' ) . ') '; } else { echo $char . '(???)'; } endfor; echo '<br>'; endfor; } formaţiune - f(102) o(111) r(114) m(109) a(97) �(???)�(???)i(105) u(117) n(110) e(101) depăşit - d(100) e(101) p(112) �(???)�(???)�(???)�(???)i(105) t(116)
I walked the internet up and down, but couldn’t find an explanation for this. I ran out of ideas.
Any thoughts?
Advertisement
Answer
Reading a lot of articles related to Unicode and Romanian diacritics, I learned a bit of history. It seems that Microsoft has made an error in the early days of Romanian charsets and misrepresented ș
and ț
as ş
and ţ
. You probably don’t notice a difference, but there is: the first characters have a comma underneath s and t, while the latter have a cedilla. It is barely noticeable to human eye, but computers are more perceptive :-). Unicode was the first standard to correct this mistake, but it created a world of problems, as a lot of the data were already using the “wrong” characters. This made text searching a lot more complicated for Romanian language.
That was actually my problem too. The language files were properly encoded in UTF-8, but they were created most probably with ISO-8859-2, containing the “wrong” characters for ș and ț. I just needed to do a complete replace of the “wrong” characters with the “right” ones.
I documented this here hoping that it will help someone in the future, as noticing the tiny visual difference between the two representations of ș and ț is not a piece of cake. It cost me a day of frustration.