Skip to content
Advertisement

how to dom html url with php?

This is the URL that I want to parse: http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0

I use simple_html_dom.php but it can’t read the HTML because the HTML is encoded.

So I think I should parse online and webpage source. Is there any way that I can parse this web site?

The source code looks like this:

<html>
  <body>
   <table class="table1">
    <tbody>
        <tr><th>***title</th>
            <th class='ltr'>***99/2/24 12:10</th>
        </tr>
        <tr><td colspan="2">***message text here<hr /></td></tr>
    </tbody>
  </table>
</body>

my code:

<?php
 require_once('simple_html_dom.php');
 $url = "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0";
 $html = file_get_html($url);
 foreach($html->find('th') as $element)
   echo $element->src . '<br>';
?>

Advertisement

Answer

The issue, as you pointed out was the encoding, it’s gzip encoded. You can set the flag in curl CURLOPT_ENCODING to work around that. What it does, as provided by php-curl documentation:

The contents of the “Accept-Encoding: ” header. This enables decoding of the response. Supported encodings are “identity”, “deflate”, and “gzip”. If an empty string, “”, is set, a header containing all supported encoding types is sent.

Use the following php-curl code to get the response html like this:

<?php

$curl = curl_init();

curl_setopt_array($curl, array(
  CURLOPT_URL => "http://www.tsetmc.com/Loader.aspx?ParTree=151313&Flow=0",
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_ENCODING => "gzip",
  CURLOPT_MAXREDIRS => 10,
  CURLOPT_TIMEOUT => 0,
  CURLOPT_FOLLOWLOCATION => true,
  CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
  CURLOPT_CUSTOMREQUEST => "GET",
));

$response = curl_exec($curl);

curl_close($curl);
echo $response;
?>

Then you can use the response html $response directly in simple_html_dom.php to parse the dom tree.

Here’s a working version of the code. http://phpfiddle.org/main/code/gb66-3kzq

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement