This is the URL that I want to parse: http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0
I use simple_html_dom.php but it can’t read the HTML because the HTML is encoded.
So I think I should parse online and webpage source. Is there any way that I can parse this web site?
The source code looks like this:
<html> <body> <table class="table1"> <tbody> <tr><th>***title</th> <th class='ltr'>***99/2/24 12:10</th> </tr> <tr><td colspan="2">***message text here<hr /></td></tr> </tbody> </table> </body>
my code:
<?php require_once('simple_html_dom.php'); $url = "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0"; $html = file_get_html($url); foreach($html->find('th') as $element) echo $element->src . '<br>'; ?>
Advertisement
Answer
The issue, as you pointed out was the encoding, it’s gzip
encoded. You can set the flag in curl CURLOPT_ENCODING
to work around that. What it does, as provided by php-curl documentation:
The contents of the “Accept-Encoding: ” header. This enables decoding of the response. Supported encodings are “identity”, “deflate”, and “gzip”. If an empty string, “”, is set, a header containing all supported encoding types is sent.
Use the following php-curl code to get the response html like this:
<?php $curl = curl_init(); curl_setopt_array($curl, array( CURLOPT_URL => "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0", CURLOPT_RETURNTRANSFER => true, CURLOPT_ENCODING => "gzip", CURLOPT_MAXREDIRS => 10, CURLOPT_TIMEOUT => 0, CURLOPT_FOLLOWLOCATION => true, CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1, CURLOPT_CUSTOMREQUEST => "GET", )); $response = curl_exec($curl); curl_close($curl); echo $response; ?>
Then you can use the response html $response
directly in simple_html_dom.php
to parse the dom tree.
Here’s a working version of the code. http://phpfiddle.org/main/code/gb66-3kzq