Skip to content
Advertisement

How to set referer Header in Guzzle and get CDN Content

I want to scrape a website and am using guzzle 7.4 and Symfony Dom Crawler

I successfully retrieved the HTML data But the website is using CDN to host some resources and they are not loading because the header is not sent to get those resources

below is code retrieving html

<?php

require "vendor/autoload.php";

use SymfonyComponentDomCrawlerCrawler;

// Url
$url = 'scrapingdomain.com';
$headers = [
    'referer' => 'examplescrapingdomain.com'
];

$client = new GuzzleHttpClient([
    'headers' => $headers
]);

// go get the data from url
$response = $client->request('GET', $url);
$html =  ''.$response->getBody();
$crawler = new Crawler($html);

echo $html;

?>

If I access the CDN directly and set referer header I get a response of 200

Below Code

<?php

require "vendor/autoload.php";

use SymfonyComponentDomCrawlerCrawler;

// Url
$url = 'examplecdnresource.com/Images.png';
$headers = [
    'referer' => 'examplescrapingdomain.com'
];

$client = new GuzzleHttpClient([
    'headers' => $headers
]);

// go get the data from url
$response = $client->request('GET', $url);
$html =  ''.$response->getBody();
$crawler = new Crawler($html);

echo $html;

?>

I want to get the scrapdomain.com get resources and download the cdn hosted images that it has

Advertisement

Answer

All I needed to do to get the CDN hosted content inside the scraped html is use file_get_content function and set referer stream to download the data no inside guzzle as i was getting css and image files

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement