I want to scrape a website and am using guzzle 7.4 and Symfony Dom Crawler
I successfully retrieved the HTML data But the website is using CDN to host some resources and they are not loading because the header is not sent to get those resources
below is code retrieving html
<?php require "vendor/autoload.php"; use SymfonyComponentDomCrawlerCrawler; // Url $url = 'scrapingdomain.com'; $headers = [ 'referer' => 'examplescrapingdomain.com' ]; $client = new GuzzleHttpClient([ 'headers' => $headers ]); // go get the data from url $response = $client->request('GET', $url); $html = ''.$response->getBody(); $crawler = new Crawler($html); echo $html; ?>
If I access the CDN directly and set referer header I get a response of 200
Below Code
<?php require "vendor/autoload.php"; use SymfonyComponentDomCrawlerCrawler; // Url $url = 'examplecdnresource.com/Images.png'; $headers = [ 'referer' => 'examplescrapingdomain.com' ]; $client = new GuzzleHttpClient([ 'headers' => $headers ]); // go get the data from url $response = $client->request('GET', $url); $html = ''.$response->getBody(); $crawler = new Crawler($html); echo $html; ?>
I want to get the scrapdomain.com get resources and download the cdn hosted images that it has
Advertisement
Answer
All I needed to do to get the CDN hosted content inside the scraped html is use file_get_content function and set referer stream to download the data no inside guzzle as i was getting css and image files