Skip to content
Advertisement

How to get specific content from a webpage using curl in php?

I want to get the most recent publication’s author and co-author names from google scholar page of an author. For this I am trying to use curl with php. But as the div has no specific ID for this and also has similar className for multiple div, I am unable to track data by web-scraping. So far I tried this:

JavaScript

Advertisement

Answer

you can parse the html with DOMDocument and DOMXPath, you can iterate the articles with the XPath

JavaScript

and once you have an article you can get the authors with the XPath

JavaScript

using the article as the reference node, so this

JavaScript

outputs:

JavaScript

(and as an optimization note, if you use curl with CURLOPT_ENCODING, it will be faster than using file_get_contents() – curl supports gzip, file_get_contents() doesn’t – also curl supports reading until “Content-Length” and stops reading, while file_get_contents() will just read until the remote server close the socket, which on some websites makes file_get_contents() significantly slower than curl.)

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement