Skip to content
Advertisement

PHP Goutte Web Scraping

I want to scrape this:

<a class="pdt_title"> 
  Japan Sun Apple - Fuji
  <span class="pdt_Tweight">2 per pack</span>
</a>

This is my code:

use GoutteClient;
$client = new Client();
$crawler = $client->request('GET', 'https://www.fairprice.com.sg/searchterm/apple');
foreach ($crawler->filter('a.pdt_title') as $node) {
    print $node->nodeValue."n";
}

I only want to scrape the text inside “a” tag without the text inside “span” tag. How to only get the text inside “a” tag?

Advertisement

Answer

Looking at the HTML markup, the text node that you want falls into the first child of the anchor. Since each $node is an instance of DOMElement, you can use ->firstChild (targeting the text node), then use ->nodeValue:

foreach ($crawler->filter('a.pdt_title') as $node) {
    echo $node->firstChild->nodeValue . "n";
}

Another alternative is to use xpath, via ->filterXpath(), its in the docs by the way:

foreach ($crawler->filterXpath('//a[@class="pdt_title"]/text()') as $text) {
    echo $text->nodeValue , "n";
}

Related docs:

https://symfony.com/doc/current/components/dom_crawler.html

The xpath query just targets the anchor with that class and then the text.

Or another one liner. It returns an array, extracting the texts:

$output = $crawler->filterXpath('//a[@class="pdt_title"]/text()')->extract(array('_text'));

Related DOM Docs:

http://php.net/manual/en/class.domelement.php
http://php.net/manual/en/class.domnode.php

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement