Skip to content
Advertisement

Scrape HTML Page that redirects to itself using Curl PHP

So i’m trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do

it seems that my code can’t get the whole page html code , it acts very wierd.

I’ve tried with simple html dom, but nothing works.

    $base = "http://www.asx.com.au/asx/statistics/todayAnns.do";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_URL, $base);
    curl_setopt($curl, CURLOPT_REFERER, $base);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $str = curl_exec($curl);
    curl_close($curl);
    echo htmlspecialchars($str);

This shows mostly javascript and i can’t get the page. My goal is to scrape that middle table on the url.

Advertisement

Answer

If you don’t need the most recent data then you can use the cached version of the page from Google.

<?php

use ScraperScrapeCrawlerTypesGeneralCrawler;
use ScraperScrapeExtractorTypesMultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler(
    'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);

// Setup configuration
$configuration = new ScraperStructureConfiguration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
    [
        new ScraperStructureTextField(
            [
                'name'  => 'Headline',
                'xpath' => './/td[3]',
            ]
        ),
        new ScraperStructureTextField(
            [
                'name'  => 'Published',
                'xpath' => './/td[1]',
            ]
        ),
        new ScraperStructureTextField(
            [
                'name'  => 'Pages',
                'xpath' => './/td[4]',
            ]
        ),
        new ScraperStructureAnchorField(
            [
                'name'               => 'Link',
                'xpath'              => './/td[5]/a',
                'convertRelativeUrl' => false,
            ]
        ),
        new ScraperStructureTextField(
            [
                'name'  => 'Code',
                'xpath' => './/text()',
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

I was able to get the following data using above code.

Array
(
    [0] => Array
        (
            [Code] => ASX
            [hash] => 6e16c02b10a10baf739c2613bc87f906
        )

    [1] => Array
        (
            [Headline] => Initial Director's Interest Notice
            [Published] => 10:57 AM
            [Pages] => 1
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
            [Code] => STO
            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
        )

    [2] => Array
        (
            [Headline] => Becoming a substantial holder
            [Published] => 10:53 AM
            [Pages] => 2
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
            [Code] => AKG
            [hash] => f8ff8dfde597a0fc68284b8957f38758
        )

    [3] => Array
        (
            [Headline] => LBT Investor Conference Call Business Update
            [Published] => 10:53 AM
            [Pages] => 9
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
            [Code] => LBT
            [hash] => cc78f327f2b421f46036de0fce270a6d
        )

...

Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful 🙂

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement