This is the URL that I want to parse: http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0
I use simple_html_dom.php but it can’t read the HTML because the HTML is encoded.
So I think I should parse online and webpage source. Is there any way that I can parse this web site?
The source code looks like this:
<html>
<body>
<table class="table1">
<tbody>
<tr><th>***title</th>
<th class='ltr'>***99/2/24 12:10</th>
</tr>
<tr><td colspan="2">***message text here<hr /></td></tr>
</tbody>
</table>
</body>
my code:
<?php
require_once('simple_html_dom.php');
$url = "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0";
$html = file_get_html($url);
foreach($html->find('th') as $element)
echo $element->src . '<br>';
?>
Advertisement
Answer
The issue, as you pointed out was the encoding, it’s gzip
encoded. You can set the flag in curl CURLOPT_ENCODING
to work around that. What it does, as provided by php-curl documentation:
The contents of the “Accept-Encoding: ” header. This enables decoding of the response. Supported encodings are “identity”, “deflate”, and “gzip”. If an empty string, “”, is set, a header containing all supported encoding types is sent.
Use the following php-curl code to get the response html like this:
<?php
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "gzip",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "GET",
));
$response = curl_exec($curl);
curl_close($curl);
echo $response;
?>
Then you can use the response html $response
directly in simple_html_dom.php
to parse the dom tree.
Here’s a working version of the code. http://phpfiddle.org/main/code/gb66-3kzq