python - Scraping poorly structured HTML -


i have website want scrape using scrapy has html structure shown @ bottom of post (titled html). want able extract information contained first <td class="small-txt dkgrey-txt rightinfotd">, i.e., 1 contains <span property =""> tag. using following code snippet try , grab data

listings = selector.css("div.whenwherecontent")          listing in listings:             body in listing.css('td.small-txt.dkgrey-txt.rightinfotd') 

however, since there multiple <td> tags same class of td.small-txt.dkgrey-txt.rightinfotd (see admission , tickets data @ bottom of html code), getting duplicate results. how can restrict for loop <td> tag correct data avoid problem?

html

<div class="whenwherecontent">     <table width="100%" cellpadding="0" cellspacing="5">             <tr>             <td class="small-txt medgrey-txt leftlabeltd">                             </td>             <td class="small-txt dkgrey-txt rightinfotd">                 <span property="v:name">                     sound academy                 </span>                 <span property="v:street-address">                 11 polson                 </span>                 <span property="v:locality">                     toronto                 </span>                  <span property="v:postal-code">                 m5a 1a4                 </span>                 <span property="v:tel" style="white-space: nowrap;">                     416-461-3625                 </span>                  info@sound-academy.com                  <a href="http://sound-academy.com" style="font-weight:900">                     <span property="v:url">                         sound-academy.com                     </span>                 </a>             </td><         </tr>          <tr>             <td class="small-txt medgrey-txt leftlabeltd">                 admission             </td>              <td class="small-txt dkgrey-txt rightinfotd">                 $39.50-$55             </td>         </tr>          <tr>             <td class="small-txt medgrey-txt leftlabeltd">                 tickets @             </td>              <td class="small-txt dkgrey-txt rightinfotd">                 ln, rt, ss             </td>         </tr>          <tr>             <td class="small-txt medgrey-txt leftlabeltd">                 when             </td>                                                <td class="rightinfotd">                 <div class="small-txt dkgrey-txt">                     <span property="v:datestart" content="2014-03-24">                         mar&nbsp;24                     </span>                       <span property="v:datestart" content="2014-03-25">                         mar&nbsp;25                     </span>                  </div>             </td>         </tr> </div>               

if want restrict td in first tr can use :nth-child() pseudo-class:

listing.css('tr:nth-child(1) td.small-txt.dkgrey-txt.rightinfotd') 

or equivalently:

listing.css('tr:first-child td.small-txt.dkgrey-txt.rightinfotd') 

css selectors can quite helpful , easier maintain. in cases, xpath may way achieve specific selection. in case, selecting td contains <span property="v:name"> can like

listing.xpath('.//td[ span[ @property="v:name" ] ]') 

Comments

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

objective c - Greedy NSProgressIndicator Allocation -

how to set an OCR language in Google Drive -