python - Get google search pages from specific dates -
i'm trying scrape google in specific time dates, year 2002, 2004, , on. can't use pygoogle, xgoogle or google search since not have option specify period searching for. so, found out query that, when running script, google sending me same results, no matter in search page am.
this code:
import time import urllib2 import re import random #define search term. agent='pt+e+pmdb' #define headers hdr = {'user-agent': 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.11 (khtml, gecko) chrome/23.0.1271.64 safari/537.11', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'accept-charset': 'iso-8859-1,utf-8;q=0.7,*;q=0.3', 'accept-encoding': 'none', 'accept-language': 'en-us,en;q=0.8', 'connection': 'keep-alive'} #inc variable of loop contador=0 #vector links stored. links2002={} #number of pages search through. npages=50 #start routine. in range(1,npages,1): tempurl2002='https://www.google.com/search?q='+str(agent)+'&hl=pt-br&biw=1137&bih=1354&sa=x&ei=er8ru8hteiqhkqeeuocicg&ved=0cboqpwuobjgu&source=lnt&tbs=cdr%3a1%2ccd_min%3a01%2f01%2f2002%2ccd_max%3a31%2f12%2f2002&tbm=#filter=0&hl=pt-br&q='+str(agent)+'&start='+str(i*10)+'&tbs=cdr:1,cd_min:01/01/2002,cd_max:31/12/2002' #url used request. req=urllib2.request(tempurl2002,headers=hdr) #search. searchresults=urllib2.urlopen(req) #get search data. page=searchresults.read() #define random pause of algorithm. wt=random.uniform(10,30) #pause algorithm in order prevent google stoping it. time.sleep(wt) #get links. links = re.findall('http[s]?://(?:[a-za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fa-f] [0-9a-fa-f]))+', page) #armazena os resultados. url in links: contador=contador+1 links2002[contador]=url
does know how right? there clever way google search results specific dates?
best, julio.
Comments
Post a Comment