python - Scraping poorly structured HTML -
i have website want scrape using scrapy has html structure shown @ bottom of post (titled html). want able extract information contained first <td class="small-txt dkgrey-txt rightinfotd">, i.e., 1 contains <span property =""> tag. using following code snippet try , grab data
listings = selector.css("div.whenwherecontent") listing in listings: body in listing.css('td.small-txt.dkgrey-txt.rightinfotd') however, since there multiple <td> tags same class of td.small-txt.dkgrey-txt.rightinfotd (see admission , tickets data @ bottom of html code), getting duplicate results. how can restrict for loop <td> tag correct data avoid problem?
html
<div class="whenwherecontent"> <table width="100%" cellpadding="0" cellspacing="5"> <tr> <td class="small-txt medgrey-txt leftlabeltd"> </td> <td class="small-txt dkgrey-txt rightinfotd"> <span property="v:name"> sound academy </span> <span property="v:street-address"> 11 polson </span> <span property="v:locality"> toronto </span> <span property="v:postal-code"> m5a 1a4 </span> <span property="v:tel" style="white-space: nowrap;"> 416-461-3625 </span> info@sound-academy.com <a href="http://sound-academy.com" style="font-weight:900"> <span property="v:url"> sound-academy.com </span> </a> </td>< </tr> <tr> <td class="small-txt medgrey-txt leftlabeltd"> admission </td> <td class="small-txt dkgrey-txt rightinfotd"> $39.50-$55 </td> </tr> <tr> <td class="small-txt medgrey-txt leftlabeltd"> tickets @ </td> <td class="small-txt dkgrey-txt rightinfotd"> ln, rt, ss </td> </tr> <tr> <td class="small-txt medgrey-txt leftlabeltd"> when </td> <td class="rightinfotd"> <div class="small-txt dkgrey-txt"> <span property="v:datestart" content="2014-03-24"> mar 24 </span> <span property="v:datestart" content="2014-03-25"> mar 25 </span> </div> </td> </tr> </div>
if want restrict td in first tr can use :nth-child() pseudo-class:
listing.css('tr:nth-child(1) td.small-txt.dkgrey-txt.rightinfotd') or equivalently:
listing.css('tr:first-child td.small-txt.dkgrey-txt.rightinfotd') css selectors can quite helpful , easier maintain. in cases, xpath may way achieve specific selection. in case, selecting td contains <span property="v:name"> can like
listing.xpath('.//td[ span[ @property="v:name" ] ]')
Comments
Post a Comment