So i’m trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do
it seems that my code can’t get the whole page html code , it acts very wierd.
I’ve tried with simple html dom, but nothing works.
$base = "http://www.asx.com.au/asx/statistics/todayAnns.do"; $curl = curl_init(); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false); curl_setopt($curl, CURLOPT_HEADER, false); curl_setopt($curl, CURLOPT_URL, $base); curl_setopt($curl, CURLOPT_REFERER, $base); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); $str = curl_exec($curl); curl_close($curl); echo htmlspecialchars($str);
This shows mostly javascript and i can’t get the page. My goal is to scrape that middle table on the url.
Advertisement
Answer
If you don’t need the most recent data then you can use the cached version of the page from Google.
<?php use ScraperScrapeCrawlerTypesGeneralCrawler; use ScraperScrapeExtractorTypesMultipleRowExtractor; require_once(__DIR__ . '/../vendor/autoload.php'); date_default_timezone_set('UTC'); // Create crawler $crawler = new GeneralCrawler( 'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0' ); // Setup configuration $configuration = new ScraperStructureConfiguration(); $configuration->setTargetXPath('//div[@class="page"]//table'); $configuration->setRowXPath('.//tr'); $configuration->setFields( [ new ScraperStructureTextField( [ 'name' => 'Headline', 'xpath' => './/td[3]', ] ), new ScraperStructureTextField( [ 'name' => 'Published', 'xpath' => './/td[1]', ] ), new ScraperStructureTextField( [ 'name' => 'Pages', 'xpath' => './/td[4]', ] ), new ScraperStructureAnchorField( [ 'name' => 'Link', 'xpath' => './/td[5]/a', 'convertRelativeUrl' => false, ] ), new ScraperStructureTextField( [ 'name' => 'Code', 'xpath' => './/text()', ] ), ] ); // Extract data $extractor = new MultipleRowExtractor($crawler, $configuration); $data = $extractor->extract(); print_r($data);
I was able to get the following data using above code.
Array ( [0] => Array ( [Code] => ASX [hash] => 6e16c02b10a10baf739c2613bc87f906 ) [1] => Array ( [Headline] => Initial Director's Interest Notice [Published] => 10:57 AM [Pages] => 1 [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833 [Code] => STO [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4 ) [2] => Array ( [Headline] => Becoming a substantial holder [Published] => 10:53 AM [Pages] => 2 [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832 [Code] => AKG [hash] => f8ff8dfde597a0fc68284b8957f38758 ) [3] => Array ( [Headline] => LBT Investor Conference Call Business Update [Published] => 10:53 AM [Pages] => 9 [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831 [Code] => LBT [hash] => cc78f327f2b421f46036de0fce270a6d ) ...
Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful 🙂