Skip to content
Advertisement

Encoding issue with PHP while writing in a .csv file

I’m working with a php array which contains some values parsed from a previous scraping process (using Simple HTML DOM Parser). I can normally print / echo the values of this array, which contains special chars é,à,è, etc. BUT, the problem is the following :

When I’m using fwrite to save values in a .csv file, some characters are not successfully saved. For example, Székesfehérvár is well displayed on my php view in HTML, but saved as Székesfehérvár in the .csv file which I generate with the php script above.

I’ve already set-up several things in the php script :

  • The page I’m scraping seems to be utf-8 encoded
  • My PHP script is also declared as utf-8 in the header
  • I’ve tried a lot of iconv and mb_encode methods in different places in the code
  • NOTE that when I’m make a JS console.log of my php array, using json_encode, the characters are also broken, maybe linked to the original encoding of the page I’m scraping?

Here’s a part of the script, it is the part who is writing values in a .csv file

<?php 

$data = array(
            array("item1", "item2"), 
            array("item1", "item2"),
            array("item1", "item2"),
            array("item1", "item2")
            // ...
);

//filename
$filename = 'myFileName.csv';

foreach($data as $line) {
    $string_txt = ""; //declares the content of the .csv as a string
    foreach($line as $item) {
        //writes a new line of the .csv
        $line_txt = "";
        //each line of the .csv equals to the values of the php subarray, tab separated
        $line_txt .= $item . "t";
    }

    //PHP endline constant, indicates the next line of the .csv
    $line_txt .= PHP_EOL;
    
    //add the line to the string which is the global content of the .csv
    $line_txt .= $string_txt;
}

//writing the string in a .csv file 
$file = fopen($filename, 'w+');
fwrite($file, $string_txt);
fclose($file);

I am currently stuck because I can’t save values with accentuated characters correctly.

Advertisement

Answer

The solution (provided by @misorude) :

When scraping HTML contents from webpages, there is a difference between what’s displayed in your debug and what’s really scraped in the script. I had to use html_entity_decode to let PHP interpret the true value of the HTML code I’ve scraped, and not the browser’s interpretation.

To validate a good retriving of values before store them somewhere, you could try a console.log in JS to see if values are correctly drived :

PHP

//decoding numeric HTML entities who represents "Sóstói Stadion"
$b = html_entity_decode("Sóstói Stadion"); 

Javascript (to test):

<script>
var b = <?php echo json_encode($b) ;?>;

//print "Sóstói Stadion" correctly
console.log(b); 
</script>
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement