使用Scrapy从细节页面提取数据

2024-09-30 03:22:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个网站上抓取代理机构的电话号码:

列表视图 http://www.authoradvance.com/agencies/

局部视图 http://www.authoradvance.com/agencies/b-personal-management/

电话号码隐藏在详细信息页中。在

那么,有没有可能通过网站的网址,如上面的详细查看网址和抓取电话号码?在

我试图用这个代码:

from scrapy.item import Item, Field

class AgencyItem(Item):
    Phone = Field()

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from agentquery.items import AgencyItem


class AgencySpider(CrawlSpider):
   name = "agency"
   allowed_domains = ["authoradvance.com"]
   start_urls = ["http://www.authoradvance.com/agencies/"]
   rules = (Rule(SgmlLinkExtractor(allow=[r'agencies/*$']), callback='parse_item'),)

   def parse_item(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select("//div[@class='section-content']")
       items = []
       for site in sites:
           item = AgencyItem()
           item['Phone'] = site.select('div[@class="phone"]/text()').extract()
           items.append(item)
       return(items)

然后我就跑了“恶心的爬虫机构-o”项目.csv-t csv“的 结果爬网了0页。在

怎么了?提前感谢您的帮助!在


Tags: fromimportcomhttp机构网站www电话号码
1条回答
网友
1楼 · 发布于 2024-09-30 03:22:06

页面上只有一个链接满足您的regex(agencies/*$):

stav@maia:~$ scrapy shell http://www.authoradvance.com/agencies/
2013-04-24 13:14:13-0500 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapybot)

>>> SgmlLinkExtractor(allow=[r'agencies/*$']).extract_links(response)
[Link(url='http://www.authoradvance.com/agencies', text=u'Agencies', fragment='', nofollow=False)]

它只是指向iteself的一个链接,它没有带有section-content类的div:

^{pr2}$

因此,循环不会迭代,items永远不会被追加。在

所以把regex改成/agencies/.+

>>> len(SgmlLinkExtractor(allow=[r'/agencies/.+']).extract_links(response))
20

>>> fetch('http://www.authoradvance.com/agencies/agency-group')
2013-04-24 13:25:02-0500 [default] DEBUG: Crawled (200) <GET http://www.authoradvance.com/agencies/agency-group> (referer: None)

>>> hxs.select("//div[@class='section-content']")
[<HtmlXPathSelector xpath="//div[@class='section-content']" data=u'<div
class="section-content">\n\t      <di'>, <HtmlXPathSelector xpath="//div
[@class='section-content']" data=u'<div class="section-content"><div class='>]

相关问题 更多 >

    热门问题