XPath do not retrieve some content

Im a a newbie trying to code a crawler to make some stats from a forum.

Here is my code :

<?php

$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);


$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);

$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[@class='who-post']/a");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");
$dates = $xpath->query("//div[@class='date-post']");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");
$contents = $xpath->query("//div[@class='message  text-enrichi-fmobile  text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");



$i = 0;
foreach ($posts as $post) {

    $nodes = $post->childNodes;

    foreach ($nodes as $node) {
    $value = trim($node->nodeValue);

      $tab[$i]['author'] = $value;
      $i++;


    }

}

$i = 0;

foreach ($dates as $date) {

    $nodes = $date->childNodes;
    foreach ($nodes as $node) {
      $value = trim($node->nodeValue);

      $tab[$i]['date'] = $value;
      $i++;
    }

}

$i = 0;

foreach ($contents as $content) {

    $nodes = $content->childNodes;
    foreach ($nodes as $node) {
      $value = $node->nodeValue;

      echo $value;

        $tab[$i]['content'] = trim($value);
        $i++;


    }

}

?>
<h1>Participants</h2>
<pre>
<?php 
print_r($tab);
?>
</pre>

JavaScript
​x
 
<?php​$ch = curl_init();$timeout = 0; // set to zero for no timeoutcurl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);$file_contents = curl_exec($ch);curl_close($ch);​​$dom = new DOMDocument;libxml_use_internal_errors(true);$dom->loadHTML($file_contents);​$xpath = new DOMXPath($dom);$posts = $xpath->query("//div[@class='who-post']/a");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");$dates = $xpath->query("//div[@class='date-post']");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");$contents = $xpath->query("//div[@class='message  text-enrichi-fmobile  text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");​​​$i = 0;foreach ($posts as $post) {​    $nodes = $post->childNodes;​    foreach ($nodes as $node) {    $value = trim($node->nodeValue);​      $tab[$i]['author'] = $value;      $i++;​​    }​}​$i = 0;​foreach ($dates as $date) {​    $nodes = $date->childNodes;    foreach ($nodes as $node) {      $value = trim($node->nodeValue);​      $tab[$i]['date'] = $value;      $i++;    }​}​$i = 0;​foreach ($contents as $content) {​    $nodes = $content->childNodes;    foreach ($nodes as $node) {      $value = $node->nodeValue;​      echo $value;​        $tab[$i]['content'] = trim($value);        $i++;​​    }​}​?><h1>Participants</h2><pre><?php print_r($tab);?></pre>​

As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm

The second post is a picture and my code do not work.

On the second hand, I guess i made some errors, I find my code ugly.

Can you help me please ?

Answer

You could simply select the posts first, then grab each subdata separately using:

DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.

Code:

$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[@class="post"]');

$posts = [];
foreach ($postsElements as $postElement) {
  $author = $xpath->evaluate('normalize-space(.//*[@class="who-post"])', $postElement);
  $date = $xpath->evaluate('normalize-space(.//*[@class="date-post"])', $postElement);

  $message = '';
  foreach ($xpath->query('.//*[contains(@class, "message")]/p', $postElement) as $messageParagraphElement) {
    $message .= $dom->saveHTML($messageParagraphElement);
  }

  $posts[] = (object)compact('author', 'date', 'message');
}

print_r($posts);

JavaScript
 
$xpath = new DOMXPath($dom);$postsElements = $xpath->query('//*[@class="post"]');​$posts = [];foreach ($postsElements as $postElement) {  $author = $xpath->evaluate('normalize-space(.//*[@class="who-post"])', $postElement);  $date = $xpath->evaluate('normalize-space(.//*[@class="date-post"])', $postElement);​  $message = '';  foreach ($xpath->query('.//*[contains(@class, "message")]/p', $postElement) as $messageParagraphElement) {    $message .= $dom->saveHTML($messageParagraphElement);  }​  $posts[] = (object)compact('author', 'date', 'message');}​print_r($posts);​

Unrelated note: scraping a website’s HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.

Advertisement

Answer