Scrape HTML Page that redirects to itself using Curl PHP

Question

So i'm trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do it seems that my code can't get the whole page html code , it acts very wierd. I've tried with simple html dom, but nothing works. This shows mostly javascript and i can't get the page. My goal is to scrape that middle table on the url. Answer If you don't need

Accepted Answer

If you don&#8217;t need the most recent data then you can use the cached version of the page from Google. <?phpuse ScraperScrapeCrawlerTypesGeneralCrawler;use ScraperScrapeExtractorTypesMultipleRowExtractor;require_once(__DIR__ . '/../vendor/autoload.php');date_default_timezone_set('UTC');// Create crawler$crawler = new GeneralCrawler(    'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0');// Setup configuration$configuration = new ScraperStructureConfiguration();$configuration->setTargetXPath('//div[@class="page"]//table');$configuration->setRowXPath('.//tr');$configuration->setFields(    [        new ScraperStructureTextField(            [                'name'  => 'Headline',                'xpath' => './/td[3]',            ]        ),        new ScraperStructureTextField(            [                'name'  => 'Published',                'xpath' => './/td[1]',            ]        ),        new ScraperStructureTextField(            [                'name'  => 'Pages',                'xpath' => './/td[4]',            ]        ),        new ScraperStructureAnchorField(            [                'name'               => 'Link',                'xpath'              => './/td[5]/a',                'convertRelativeUrl' => false,            ]        ),        new ScraperStructureTextField(            [                'name'  => 'Code',                'xpath' => './/text()',            ]        ),    ]);// Extract  data$extractor = new MultipleRowExtractor($crawler, $configuration);$data = $extractor->extract();print_r($data);I was able to get the following data using above code. Array(    [0] => Array        (            [Code] => ASX            [hash] => 6e16c02b10a10baf739c2613bc87f906        )    [1] => Array        (            [Headline] => Initial Director's Interest Notice            [Published] => 10:57 AM            [Pages] => 1            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833            [Code] => STO            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4        )    [2] => Array        (            [Headline] => Becoming a substantial holder            [Published] => 10:53 AM            [Pages] => 2            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832            [Code] => AKG            [hash] => f8ff8dfde597a0fc68284b8957f38758        )    [3] => Array        (            [Headline] => LBT Investor Conference Call Business Update            [Published] => 10:53 AM            [Pages] => 9            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831            [Code] => LBT            [hash] => cc78f327f2b421f46036de0fce270a6d        )...  Disclaimer: I used https://github.com/rajanrx/php-scrape framework and  I am an author of that library. You can grab data using simple curl as well using the  xpath listed above.I hope this might be helpful 🙂

Advertisement

Answer