Python Scrapy 抓取表格列和行

2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/> {'company': [u"\nDomino's Pizza\n"], 'industry': [u"\nDomino's Pizza\n"], 'person': [u"\nDomino's Pizza\n"], 'url': [u'/cio100/2013/dominos-pizza/']} 2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/> {'company': [u'\nColin Rees\n'], 'industry': [u'\nColin Rees\n'], 'person': [u'\nColin Rees\n'], 'url': [u'/cio100/2013/dominos-pizza/']}

2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/> {'company': [u'\nRetail\n'], 'industry': [u'\nRetail\n'], 'person': [u'\nRetail\n'], 'url': [u'/cio100/2013/dominos-pizza/']}

1条回答

网友

1楼 · 发布于 2024-10-01 11:32:34

就xpath而言，考虑执行以下操作：

$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[@class="bgWhite listTable"]/tr'):
...     item = Cio100Item()
...     item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
...     item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
...     item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
...     item['url'] = tr.xpath('td[4]//a/@href').extract()[0].strip()
...     print item
... 
{'company': u'LOCOG',
 'industry': u'Leisure and entertainment',
 'person': u'Gerry Pennell',
 'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
 'industry': u'Leisure and entertainment',
 'person': u'Adam Gerrard',
 'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
 'industry': u'Communications and IT services',
 'person': u'Albert Hitchcock',
 'url': u'/cio100/2013/vodafone/'}
...

除此之外，你最好一个一个地yield项，而不是将它们累加在一个列表中

相关问题更多 >

编程相关推荐

热门问题

热门文章