I am developing an application which crawls data from another website. That website is protected by a login, but I have an account there. My application should login to that website and return the content of the protected web page. I managed to get this to work in Python using the requests package.
Now I want to accomplish the same thing in PHP using cURL. Unfortunately, until this moment, I couldn’t make this work, and I would like your help.
Before you can login, the website requires a verification token. So, you first have to obtain the Token, and then login afterwards. Here is my (working!) Python code:
import requests url = "https://www.mywebsite.com/login.php" headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"} s = requests.Session() // Get token r1 = s.get(url, headers = headers) cacheToken = ExtractTokenFromText(r1.text) // some function defined by me // Login data = {'username': 'myusername', 'password': 'mypassword', '__RequestVerificationToken': cacheToken} r2 = s.post(url, headers = headers, data = data) my_content = r2.text
Now I try to implement the same functionality in PHP using the cURL library. My PHP code is:
$url = "https://www.mywebsite.com/login.php"; $ch = curl_init(); // Get token curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback"); $r1 = curl_exec($ch); function curlResponseHeaderCallback($ch, $headerLine) { global $cache_token; $cache_token = ExtractTokenFromHeader($headerline); // some function defined by me return strlen($headerLine); // Needed by curl } // Login $post_data = array('username' => $myusername, 'password' => $mypassword, '__RequestVerificationToken' => $cache_token); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data)); $r2 = curl_exec($ch); $my_content = $r2;
The PHP file correctly receives the $cache_token, so the GET request is executed correctly. Unfortunately, the POST request is not working, because the PHP file gives the following error message:
“Your antiforgery token is invalid.” with a HTTP 400 Bad Request.
I tried many things to fix the problem, but none of them work:
- Adding a user agent curl_setopt($ch, CURLOPT_USERAGENT, “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36”);
- Adding curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); to the second request.
- Adding curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); with one of the following three options:
$cookiefile = getcwd() . '/cookie.txt'; $cookiefile = __DIR__ . '/cookie.txt'; $cookiefile = 'cookie.txt';
- Adding curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile); (as above)
- Adding curl_setopt($ch,CURLOPT_FOLLOWLOCATION,TRUE);
- Enable error reporting to find the problem
- Using their API directly (it only works if you pay them)
I would like to stress that the Python version works, but the PHP version doesn’t. (So I’m sure there are no missing or hidden parameters, captcha’s to handle with, etc.)
My question is similar to this and this question, and in a lesser way to this and this question, but their solutions either don’t work for me and there are no answers at all…
Advertisement
Answer
So, thanks to @Steven Penny and this great YouTube video I finally managed to get it working. Key differences:
- Completely different way of getting the token from the first GET: Not using CURLOPT_RETURNTRANSFER but directly from $r1.text
- Added curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); for both calls
- Added curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); for the second call only
My final script:
$url = "https://www.mywebsite.com/login.php"; $cookiefile = 'cookie.txt'; $ch = curl_init(); // Get token curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); $r1 = curl_exec($ch); $dom = new DOMDocument; $dom->loadHTML($response); $tags = $dom->getElementsByTagName('input'); $token = ''; for($i=0; $i<$tags->length; $i++) { $grab = $tags->item($i); if ($grab->getAttribute('name') === '__RequestVerificationToken') { $token = $grab->getAttribute('value'); } } // Login $post_data = array('username' => $myusername, 'password' => $mypassword, '__RequestVerificationToken' => $cache_token); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data)); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $r2 = curl_exec($ch); $my_content = $r2;