Skip to content
Advertisement

Using php-spider, is there a standard Xpath that might discover the URIs on most web sites?

I am using the wonderful script entitled php-spider with the goal of scraping the Title, Desc, H1, H2, H3, and H4 from a few web sites. As part of configuring the script, it is necessary to set an ‘XpathExpressionDiscoverer’ to instruct the script how to find additional hyperlinks on each page for crawling. I assume this refers to the standard Xpath query language.

My goal is to find an XpathExpressionDiscoverer that will generally work for most web sites (rather than requiring me to customize it for each site).

Here is what I have tried:

I noticed the example provided by the author uses a very specific XpathExpressionDiscoverer to crawl the given example site:

JavaScript

Since my goal is simply to discover any hyperlinks on the page, I tried expanding the XPath to something more general (“//a”) as shown below:

JavaScript

While this new Xpath successfully crawls the example site (dmoztools.net), it does not seem to work for other examples I try (below). It simply crawls the seed page but fails to discover or crawl additional URIs on the page (even though they have A HREF tags which should match the Xpath).

Example A: https://www.petco.com/shop/en/petcostore/category/fish

Example B: https://www.thetruthaboutcars.com/

Do you happen to see where I am going wrong? Thank you!

Advertisement

Answer

The example code contains this line:

JavaScript

That should be:

JavaScript

Note the addition of https as an allowed schema. Without that, only URLs with http schema are allowed, and the websites you give as an example are https.

BTW, when I tested this, I discovered a bug where URLs without a path and without a trailing slash would sometimes cause a failure. I have added a fix for that bug in version 0.4.4. Please upgrade.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement