Using php-spider, is there a standard Xpath that might discover the URIs on most web sites?

I am using the wonderful script entitled php-spider with the goal of scraping the Title, Desc, H1, H2, H3, and H4 from a few web sites. As part of configuring the script, it is necessary to set an ‘XpathExpressionDiscoverer’ to instruct the script how to find additional hyperlinks on each page for crawling. I assume this refers to the standard Xpath query language.

My goal is to find an XpathExpressionDiscoverer that will generally work for most web sites (rather than requiring me to customize it for each site).

Here is what I have tried:

I noticed the example provided by the author uses a very specific XpathExpressionDiscoverer to crawl the given example site:

// The URI we want to start crawling with
$seed = 'http://dmoztools.net/Computers/Internet/';

// We add an URI discoverer. Without it, the spider wouldn't get past the seed resource.
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//*[@id='cat-list-content-2']/div/a"));

Since my goal is simply to discover any hyperlinks on the page, I tried expanding the XPath to something more general (“//a”) as shown below:

// We add an URI discoverer. Without it, the spider wouldn't get past the seed resource.
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//a"));

While this new Xpath successfully crawls the example site (dmoztools.net), it does not seem to work for other examples I try (below). It simply crawls the seed page but fails to discover or crawl additional URIs on the page (even though they have A HREF tags which should match the Xpath).

Example A: https://www.petco.com/shop/en/petcostore/category/fish

Example B: https://www.thetruthaboutcars.com/

Do you happen to see where I am going wrong? Thank you!

Answer

The example code contains this line:

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('http')));

That should be:

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('http', 'https')));

Note the addition of https as an allowed schema. Without that, only URLs with http schema are allowed, and the websites you give as an example are https.

BTW, when I tested this, I discovered a bug where URLs without a path and without a trailing slash would sometimes cause a failure. I have added a fix for that bug in version 0.4.4. Please upgrade.

Advertisement

Answer