python - Get google search pages from specific dates -

May 15, 2011

i'm trying scrape google in specific time dates, year 2002, 2004, , on. can't use pygoogle, xgoogle or google search since not have option specify period searching for. so, found out query that, when running script, google sending me same results, no matter in search page am.

this code:

import time import urllib2 import re import random #define search term. agent='pt+e+pmdb'  #define headers hdr = {'user-agent': 'mozilla/5.0 (x11; linux x86_64) applewebkit/537.11 (khtml, gecko) chrome/23.0.1271.64 safari/537.11', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'accept-charset': 'iso-8859-1,utf-8;q=0.7,*;q=0.3', 'accept-encoding': 'none', 'accept-language': 'en-us,en;q=0.8', 'connection': 'keep-alive'}  #inc variable of loop contador=0 #vector links stored. links2002={} #number of pages search through. npages=50  #start routine. in range(1,npages,1):     tempurl2002='https://www.google.com/search?q='+str(agent)+'&hl=pt-br&biw=1137&bih=1354&sa=x&ei=er8ru8hteiqhkqeeuocicg&ved=0cboqpwuobjgu&source=lnt&tbs=cdr%3a1%2ccd_min%3a01%2f01%2f2002%2ccd_max%3a31%2f12%2f2002&tbm=#filter=0&hl=pt-br&q='+str(agent)+'&start='+str(i*10)+'&tbs=cdr:1,cd_min:01/01/2002,cd_max:31/12/2002'     #url used request.     req=urllib2.request(tempurl2002,headers=hdr)     #search.     searchresults=urllib2.urlopen(req)     #get search data.     page=searchresults.read()     #define random pause of algorithm.     wt=random.uniform(10,30)     #pause algorithm in order prevent google stoping it.     time.sleep(wt)     #get links.     links = re.findall('http[s]?://(?:[a-za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fa-f] [0-9a-fa-f]))+', page)     #armazena os resultados.     url in links:         contador=contador+1         links2002[contador]=url

does know how right? there clever way google search results specific dates?

best, julio.

Search This Blog

Silver

python - Get google search pages from specific dates -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -