php - Get specific content with CURL on all links in a page (like a spider) -
i'm coding little app start url , on links in specific page. next, go on links , scrape contents show specific content (numbers 10 or more char). code retrieve blank page, wrong?
//i $url = 'http://xxx.xxx'; $str = file_get_contents($url); $original_file = file_get_contents($url); $stripped_file = strip_tags($original_file, "<a>"); preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches); $links = $matches[1]; //print_r($links); //f //f $count = count($links); for($i=0;$i<=$count;$i++) { //i $curl_handle=curl_init(); curl_setopt($curl_handle, curlopt_url,$links[$i]); curl_setopt($curl_handle, curlopt_connecttimeout, 2); curl_setopt($curl_handle, curlopt_returntransfer, 1); curl_setopt($curl_handle, curlopt_useragent, 'mozilla/5.0 (windows; u; windows nt 5.1; rv:1.7.3) gecko/20041001 firefox/0.10.1'); $query = curl_exec($curl_handle); curl_close($curl_handle); preg_match_all('/\b3\d+/', $query, $matches2); $numbers = $matches2[0]; $count = 0; foreach($numbers $value) { if(strlen((string)$value) >= 10) echo '<br><br>[' . $count++ . "]" . $value; } //f } //f
issue#1: html can have urls following picking links /home/test.php
without base http://www.example.com/
. before requesting curl, print on screen or browser , check is.
<a href="/home/test.php">link</a>
issue#2: 2
seconds curlopt_connecttimeout
can prove less you. try increasing value.
curl_setopt($curl_handle, curlopt_connecttimeout, 10);
if problem still persists, please show sample page link. , sample internal link blank response.
Comments
Post a Comment