What is correct example (up-to-date approach) use CURL-MULTI? I use the below code, but many times, it fails to get the content (returns empty result, and neither I have experience how to retrieve the correct repsonse/error):
public function multi_curl($urls) { $AllResults =[]; $mch = curl_multi_init(); $handlesArray=[]; $curl_conn_timeout= 3 *60; //max 3 minutes $curl_max_timeout = 30*60; //max 30 minutes foreach ($urls as $key=> $url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_HEADER, false); // timeouts: https://thisinterestsme.com/php-setting-curl-timeout/ and https://stackoverflow.com/a/15982505/2377343 curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_conn_timeout); curl_setopt($ch, CURLOPT_TIMEOUT, $curl_max_timeout); if (defined('CURLOPT_TCP_FASTOPEN')) curl_setopt($ch, CURLOPT_TCP_FASTOPEN, 1); curl_setopt($ch, CURLOPT_ENCODING, ""); // empty to autodetect | gzip,deflate curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($ch, CURLOPT_URL, $url); $handlesArray[$key] = $ch; curl_multi_add_handle($mch, $handlesArray[$key]); } // other approaches are deprecated ! https://stackoverflow.com/questions/58971677/ do { $execReturnValue = curl_multi_exec($mch, $runningHandlesAmount); usleep(100); // stop 100 microseconds to avoid infinity speed recursion } while ($runningHandlesAmount>0); //exec now foreach($urls as $key => $url) { $AllResults[$key]['url'] =$url; $handle = $handlesArray[$key]; // Check for errors $curlError = curl_error($handle); if ($curlError!="") { $AllResults[$key]['error'] =$curlError; $AllResults[$key]['response'] =false; } else { $AllResults[$key]['error'] =false; $AllResults[$key]['response'] =curl_multi_getcontent($handle); } curl_multi_remove_handle($mch, $handle); curl_close($handle); } curl_multi_close($mch); return $AllResults; }
and executing:
$urls = [ 'https://baconipsum.com/api/?type=meat-and-filler', 'https://baconipsum.com/api/?type=all-meat¶s=2']; $results = $helpers->multi_curl($urls);
Is there something, that can be changed, to have better results?
update: I’ve found this repository also mentions the lack of documentation about the best-use-case for multi-curl
and provides their approach. However, I ask this on SO to get other competent answers too.
Advertisement
Answer
I use the below code
that code has issues:
- it has NO connection cap, if you try to open 1 million urls simultaneously, it will try to create 1 million tcp connections at once (many websites will block you as a TCP DDoS around 100!)
- it doesn’t even verify that it was able to create the curl easy handles (which it definitely won’t be able to do if it has too many urls, see the first issue)
- it sleeps for 100 microseconds, which may be 100 microseconds longer than required, it’s supposed to use select() to let the OS tell it exactly when the data has arrived/been-sent, not wait 100 us (with curl_multi_select())
- doesn’t detect transfer errors..
- (optimization nitpicking) it doesn’t fetch any workers data until every single worker has finished, an optimized implementation would drain completed workers while still-working-workers would be transferring simultaneously..
- (optimization-nitpicking) it doesn’t re-use handles
- (optimization nitpicking) it doesn’t remove completed workers from the multi_list until every single worker has finished, which use more cpu in every curl_multi_exec call (because mutli_exec has to iterate even the finished workers that are still in the list)
this implementation should be significantly faster, has a configurable limit on max simultaneous connections, re-use curl handles, removes completed workers asap, detect curl_multi errors, etc
/** * fetch all urls in parallel, * warning: all urls must be unique.. * * @param array $urls_unique * urls to fetch * @param int $max_connections * (optional, default 100) max simultaneous connections * (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously, * and some wifi routers will get unstable on too many connectionis.. ) * @param array $additional_curlopts * (optional) set additional curl options here, each curl handle will get these options * @throws RuntimeException on curl_multi errors * @throws RuntimeException on curl_init() / curl_setopt() errors * @return array(url=>response,url2=>response2,...) */ function curl_fetch_multi_2(array $urls_unique, int $max_connections = 100, array $additional_curlopts = null) { // $urls_unique = array_unique($urls_unique); $ret = array(); $mh = curl_multi_init(); // $workers format: [(int)$ch]=url $workers = array(); $max_connections = min($max_connections, count($urls_unique)); $unemployed_workers = array(); for ($i = 0; $i < $max_connections; ++ $i) { $unemployed_worker = curl_init(); if (! $unemployed_worker) { throw new RuntimeException("failed creating unemployed worker #" . $i); } $unemployed_workers[] = $unemployed_worker; } unset($i, $unemployed_worker); $work = function () use (&$workers, &$unemployed_workers, &$mh, &$ret): void { assert(count($workers) > 0, "work() called with 0 workers!!"); $still_running = null; for (;;) { do { $err = curl_multi_exec($mh, $still_running); } while ($err === CURLM_CALL_MULTI_PERFORM); if ($err !== CURLM_OK) { $errinfo = [ "multi_exec_return" => $err, "curl_multi_errno" => curl_multi_errno($mh), "curl_multi_strerror" => curl_multi_strerror($err) ]; $errstr = "curl_multi_exec error: " . str_replace([ "r", "n" ], "", var_export($errinfo, true)); throw new RuntimeException($errstr); } if ($still_running < count($workers)) { // some workers has finished downloading, process them // echo "processing!"; break; } else { // no workers finished yet, sleep-wait for workers to finish downloading. // echo "select()ing!"; curl_multi_select($mh, 1); // sleep(1); } } while (false !== ($info = curl_multi_info_read($mh))) { if ($info['msg'] !== CURLMSG_DONE) { // no idea what this is, it's not the message we're looking for though, ignore it. continue; } if ($info['result'] !== CURLM_OK) { $errinfo = [ "effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL), "curl_errno" => curl_errno($info['handle']), "curl_error" => curl_error($info['handle']), "curl_multi_errno" => curl_multi_errno($mh), "curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh)) ]; $errstr = "curl_multi worker error: " . str_replace([ "r", "n" ], "", var_export($errinfo, true)); throw new RuntimeException($errstr); } $ch = $info['handle']; $ch_index = (int) $ch; $url = $workers[$ch_index]; $ret[$url] = curl_multi_getcontent($ch); unset($workers[$ch_index]); curl_multi_remove_handle($mh, $ch); $unemployed_workers[] = $ch; } }; $opts = array( CURLOPT_URL => '', CURLOPT_RETURNTRANSFER => 1, CURLOPT_ENCODING => '' ); if (! empty($additional_curlopts)) { // i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about foreach ($additional_curlopts as $key => $val) { $opts[$key] = $val; } } foreach ($urls_unique as $url) { while (empty($unemployed_workers)) { $work(); } $new_worker = array_pop($unemployed_workers); $opts[CURLOPT_URL] = $url; if (! curl_setopt_array($new_worker, $opts)) { $errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true); throw new RuntimeException($errstr); } $workers[(int) $new_worker] = $url; curl_multi_add_handle($mh, $new_worker); } while (count($workers) > 0) { $work(); } foreach ($unemployed_workers as $unemployed_worker) { curl_close($unemployed_worker); } curl_multi_close($mh); return $ret; }