使用scrapy在url内进行抓取

2024-10-01 09:15:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用scrapy来抓取craigslist,并且已经成功地获得了url,但是现在我想从url中的页面中提取数据。代码如下:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem

class craigslist_spider(BaseSpider):
    name = "craigslist_unique"
    allowed_domains = ["craiglist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
        "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
    "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
    ]


def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
   #item['desc'] = site.select('text()').extract()
       items.append(item)
   hxs = HtmlXPathSelector(response)
   #print title, link        
   return items

我是新来的scrapy,无法弄清楚如何真正地点击url(href)并在该url的页面内获取数据,并对所有url执行该操作。在


Tags: fromorgimporthttpurlsearchsiteitems
2条回答

parse方法中,逐个接收start_urls的响应

如果您只想从start_urls响应中获取信息,那么您的代码就可以了。但是您的解析方法应该在您的craigslist_spider类中,而不是在该类之外。在

def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
       items.append(item)
   #print title, link
   return items

如果您想从起始URL获取一半信息,从start_urls响应中的anchor获取一半信息,该怎么办?在

^{pr2}$

您只需要在parse方法中yield Request,并使用Requestmeta来发送{}

然后在anchor_page中提取old_item,在其中添加新值并简单地生成它。在

你的xpath有个问题-它们应该是相对的。代码如下:

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class CraigslistItem(Item):
    title = Field()
    link = Field()


class CraigslistSpider(BaseSpider):
    name = "craigslist_unique"
    allowed_domains = ["craiglist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
        "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
        "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//span[@class='pl']")
        items = []
        for site in sites:
            item = CraigslistItem()
            item['title'] = site.select('.//a/text()').extract()[0]
            item['link'] = site.select('.//a/@href').extract()[0]
            items.append(item)
        return items

如果通过以下方式运行:

^{pr2}$

你会看到的输出.json公司名称:

{"link": "/sby/sof/3824966457.html", "title": "HR Admin/Tech Recruiter"}
{"link": "/eby/sof/3824932209.html", "title": "Entry Level Web Developer"}
{"link": "/sfc/sof/3824500262.html", "title": "Sr. Ruby on Rails Contractor @ Funded Startup"}
...

希望有帮助。在

相关问题 更多 >