使用Scrapy从细节页面提取数据

from scrapy.item import Item, Field class AgencyItem(Item): Phone = Field() from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from agentquery.items import AgencyItem class AgencySpider(CrawlSpider): name = "agency" allowed_domains = ["authoradvance.com"] start_urls = ["http://www.authoradvance.com/agencies/"] rules = (Rule(SgmlLinkExtractor(allow=[r'agencies/*$']), callback='parse_item'),) def parse_item(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select("//div[@class='section-content']") items = [] for site in sites: item = AgencyItem() item['Phone'] = site.select('div[@class="phone"]/text()').extract() items.append(item) return(items)

1条回答

网友

1楼 · 发布于 2024-09-30 03:22:06

页面上只有一个链接满足您的regex（agencies/*$）：

stav@maia:~$ scrapy shell http://www.authoradvance.com/agencies/
2013-04-24 13:14:13-0500 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapybot)

>>> SgmlLinkExtractor(allow=[r'agencies/*$']).extract_links(response)
[Link(url='http://www.authoradvance.com/agencies', text=u'Agencies', fragment='', nofollow=False)]

它只是指向iteself的一个链接，它没有带有section-content类的div：

^{pr2}$

因此，循环不会迭代，items永远不会被追加。在

所以把regex改成/agencies/.+

>>> len(SgmlLinkExtractor(allow=[r'/agencies/.+']).extract_links(response))
20

>>> fetch('http://www.authoradvance.com/agencies/agency-group')
2013-04-24 13:25:02-0500 [default] DEBUG: Crawled (200) <GET http://www.authoradvance.com/agencies/agency-group> (referer: None)

>>> hxs.select("//div[@class='section-content']")
[<HtmlXPathSelector xpath="//div[@class='section-content']" data=u'<div
class="section-content">\n\t      <di'>, <HtmlXPathSelector xpath="//div
[@class='section-content']" data=u'<div class="section-content"><div class='>]

相关问题更多 >

编程相关推荐

热门问题

热门文章