Skip to content
Advertisement

How to crawl page in PHP?

I get the error: “error code: 1020”. The page I’m trying to crawl for form data is: https://v2.gcchmc.org/medical-status-search/.

This is my code:

$initial = file_get_contents('https://v2.gcchmc.org/medical-status-search/');

$check = preg_replace('/.+?input type="hidden" name="csrfmiddlewaretoken" value="(.+?)".*/sim', '$1'. $initial);

print $check;

“error code: 1020” the page I am trying to crawl for form data is https://v2.gcchmc.org/medical-status-search/. Can you help me what’s wrong in the code below.

Advertisement

Answer

The site is protected by cloudflare. You can bypass the cloudflare when you have javascript enabled, so through command line is not going to work. You can however automate this by using Puppeteer for example, which also is available in PHP. But you have to disable headless to make it work.

Installation

composer require nesk/puphpeteer
npm install @nesk/puphpeteer

The script (test.php)

use NeskPuphpeteerPuppeteer;
require_once __DIR__ . "/vendor/autoload.php";

function getToken($content)
{
    preg_match_all('/.+?input type="hidden" name="csrfmiddlewaretoken" value="(.+?)".*/sim', $content, $matches);
    return $matches[1][0];
}

$puppeteer = new Puppeteer;
$browser = $puppeteer->launch(['headless'=>false]);

/**
 * @var $page NeskPuphpeteerResourcesPage
 */
$page = $browser->newPage();
$page->goto('https://v2.gcchmc.org/medical-status-search/');

var_dump(getToken($page->content()));

$browser->close();

Now you probably don’t need the csrfmiddlewaretoken when running the script like this, but you can take it further from here if you chose to use this feature.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement