I’m working with a php array which contains some values parsed from a previous scraping process (using Simple HTML DOM Parser
). I can normally print
/ echo
the values of this array, which contains special chars é,à,è
, etc. BUT, the problem is the following :
When I’m using fwrite
to save values in a .csv file, some characters are not successfully saved. For example, Székesfehérvár
is well displayed on my php view in HTML
, but saved as Székesfehérvár
in the .csv
file which I generate with the php script above.
I’ve already set-up several things in the php script :
- The page I’m scraping seems to be utf-8 encoded
- My PHP script is also declared as utf-8 in the header
- I’ve tried a lot of
iconv
andmb_encode
methods in different places in the code - NOTE that when I’m make a JS console.log of my php array, using json_encode, the characters are also broken, maybe linked to the original encoding of the page I’m scraping?
Here’s a part of the script, it is the part who is writing values in a .csv
file
<?php $data = array( array("item1", "item2"), array("item1", "item2"), array("item1", "item2"), array("item1", "item2") // ... ); //filename $filename = 'myFileName.csv'; foreach($data as $line) { $string_txt = ""; //declares the content of the .csv as a string foreach($line as $item) { //writes a new line of the .csv $line_txt = ""; //each line of the .csv equals to the values of the php subarray, tab separated $line_txt .= $item . "t"; } //PHP endline constant, indicates the next line of the .csv $line_txt .= PHP_EOL; //add the line to the string which is the global content of the .csv $line_txt .= $string_txt; } //writing the string in a .csv file $file = fopen($filename, 'w+'); fwrite($file, $string_txt); fclose($file);
I am currently stuck because I can’t save values with accentuated characters correctly.
Advertisement
Answer
The solution (provided by @misorude) :
When scraping HTML contents from webpages, there is a difference between what’s displayed in your debug and what’s really scraped in the script. I had to use html_entity_decode
to let PHP interpret the true value of the HTML code I’ve scraped, and not the browser’s interpretation.
To validate a good retriving of values before store them somewhere, you could try a console.log in JS to see if values are correctly drived :
PHP
//decoding numeric HTML entities who represents "Sóstói Stadion" $b = html_entity_decode("Sóstói Stadion");
Javascript (to test):
<script> var b = <?php echo json_encode($b) ;?>; //print "Sóstói Stadion" correctly console.log(b); </script>