I want to scrape this:
<a class="pdt_title"> Japan Sun Apple - Fuji <span class="pdt_Tweight">2 per pack</span> </a>
This is my code:
use GoutteClient; $client = new Client(); $crawler = $client->request('GET', 'https://www.fairprice.com.sg/searchterm/apple'); foreach ($crawler->filter('a.pdt_title') as $node) { print $node->nodeValue."n"; }
I only want to scrape the text inside “a” tag without the text inside “span” tag. How to only get the text inside “a” tag?
Advertisement
Answer
Looking at the HTML markup, the text node that you want falls into the first child of the anchor. Since each $node
is an instance of DOMElement
, you can use ->firstChild
(targeting the text node), then use ->nodeValue
:
foreach ($crawler->filter('a.pdt_title') as $node) { echo $node->firstChild->nodeValue . "n"; }
Another alternative is to use xpath, via ->filterXpath()
, its in the docs by the way:
foreach ($crawler->filterXpath('//a[@class="pdt_title"]/text()') as $text) { echo $text->nodeValue , "n"; }
Related docs:
https://symfony.com/doc/current/components/dom_crawler.html
The xpath query just targets the anchor with that class and then the text.
Or another one liner. It returns an array, extracting the texts:
$output = $crawler->filterXpath('//a[@class="pdt_title"]/text()')->extract(array('_text'));
Related DOM Docs:
http://php.net/manual/en/class.domelement.php
http://php.net/manual/en/class.domnode.php