This is the URL that I want to parse: http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0
I use simple_html_dom.php but it can’t read the HTML because the HTML is encoded.
So I think I should parse online and webpage source. Is there any way that I can parse this web site?
The source code looks like this:
<html>
<body>
<table class="table1">
<tbody>
<tr><th>***title</th>
<th class='ltr'>***99/2/24 12:10</th>
</tr>
<tr><td colspan="2">***message text here<hr /></td></tr>
</tbody>
</table>
</body>
my code:
<?php
require_once('simple_html_dom.php');
$url = "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0";
$html = file_get_html($url);
foreach($html->find('th') as $element)
echo $element->src . '<br>';
?>
Advertisement
Answer
The issue, as you pointed out was the encoding, it’s gzip encoded. You can set the flag in curl CURLOPT_ENCODING to work around that. What it does, as provided by php-curl documentation:
The contents of the “Accept-Encoding: ” header. This enables decoding of the response. Supported encodings are “identity”, “deflate”, and “gzip”. If an empty string, “”, is set, a header containing all supported encoding types is sent.
Use the following php-curl code to get the response html like this:
<?php $curl = curl_init(); curl_setopt_array($curl, array( CURLOPT_URL => "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0", CURLOPT_RETURNTRANSFER => true, CURLOPT_ENCODING => "gzip", CURLOPT_MAXREDIRS => 10, CURLOPT_TIMEOUT => 0, CURLOPT_FOLLOWLOCATION => true, CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1, CURLOPT_CUSTOMREQUEST => "GET", )); $response = curl_exec($curl); curl_close($curl); echo $response; ?>
Then you can use the response html $response directly in simple_html_dom.php to parse the dom tree.
Here’s a working version of the code. http://phpfiddle.org/main/code/gb66-3kzq