Skip to content
Advertisement

Codeigniter 4 localization issue with UTF-8 characters

The context

I am trying to perform a lemmatization of some texts and I figured out that I can use CI4 localization for this. Basically I created some files in AppLanguagesro-RO and I am “translating” the words to their linguistic root.

The language files are encoded in UTF-8 (I checked it on the server with file -i command). PHP has UTF-8 as default charset. Apache has an AddDefaultCharset UTF-8 setting.

The header of each page has a proper declaration header('Content-Type: text/html; charset=UTF-8');. In CI I configured the App.php with public $charset = 'UTF-8' and public $defaultLocale = 'ro-RO';. In the header of each page served I also put the commands:

mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');

The issue

Whenever a label contains the Romanian diacritics ș or ț, the lang() function cannot find the translation. On the contrary, it works fine with the other Romanian diacritics (ă, î and â). Apparently ș and ț are part of the Latin Extended B, while the others are in Latin Extended A.

Curiously, mb_ord() does not return an integer value for any of these diacritics. I made a small function to take each word and display it letter by letter along with the char code. You can see the result ($chunks is an array containing the words, clean_character function is just checking if mb_ord returns an integer) :

  private function displayTextInfo( $chunks ) {
    for ($i=0; $i<count($chunks); $i++):
      echo $chunks[$i] . ' - ';
      for ($j=0; $j<strlen($chunks[$i]); $j++):
        $char = substr($chunks[$i], $j, 1);
        if ( $this->clean_character($char) ) {
          echo $char . '(' . mb_ord( $char, 'UTF-8' ) . ') ';
        } else {
          echo $char . '(???)';
        }
      endfor;
      echo '<br>';
    endfor;
  }

formaţiune - f(102) o(111) r(114) m(109) a(97) �(???)�(???)i(105) u(117) n(110) e(101)
depăşit - d(100) e(101) p(112) �(???)�(???)�(???)�(???)i(105) t(116)

I walked the internet up and down, but couldn’t find an explanation for this. I ran out of ideas.

Any thoughts?

Advertisement

Answer

Reading a lot of articles related to Unicode and Romanian diacritics, I learned a bit of history. It seems that Microsoft has made an error in the early days of Romanian charsets and misrepresented ș and ț as ş and ţ. You probably don’t notice a difference, but there is: the first characters have a comma underneath s and t, while the latter have a cedilla. It is barely noticeable to human eye, but computers are more perceptive :-). Unicode was the first standard to correct this mistake, but it created a world of problems, as a lot of the data were already using the “wrong” characters. This made text searching a lot more complicated for Romanian language.

That was actually my problem too. The language files were properly encoded in UTF-8, but they were created most probably with ISO-8859-2, containing the “wrong” characters for ș and ț. I just needed to do a complete replace of the “wrong” characters with the “right” ones.

I documented this here hoping that it will help someone in the future, as noticing the tiny visual difference between the two representations of ș and ț is not a piece of cake. It cost me a day of frustration.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement