如何在嵌套URL scrape中传递单个链接?

2024-10-01 00:20:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个问题,我刮一个子页面与链接,我在主页上获得。你知道吗

每个漫画都有自己的页面,所以我试着打开每一个项目的页面,并刮价。你知道吗

这是蜘蛛:

class PaniniSpider(scrapy.Spider):
    name = "spiderP"
    start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]

    def parse(self, response):
        # Get all the <a> tags
        for sel in response.xpath("//div[@class='list-group']//h3/a"):
            l = ItemLoader(item=ComicscraperItem(), selector=sel)
            l.add_xpath('title', './text()')
            l.add_xpath('link', './@href')

            request = scrapy.Request(sel.xpath('./@href').extract_first(), callback=self.parse_isbn, dont_filter=True)
            request.meta['l'] = l
            yield request

    def parse_isbn(self, response):
        l = response.meta['l']
        l.add_xpath('price', "//p[@class='special-price']//span/text()")
        return l.load_item()

问题是关于链接,输出类似于:

{"title": "Spider-Man 14", "link": ["http://comics.panini.it/store/pub_ita_it/mmmsm014isbn-it-marvel-masterworks-spider-man-marvel-masterworks-spider.html"], "price": ["\n                    \u20ac\u00a022,50                ", "\n                    \u20ac\u00a076,50                ", "\n                    \u20ac\u00a022,50                ", "\n                    \u20ac\u00a022,50                ", "\n                    \u20ac\u00a022,50                ", "\n                    \u20ac\u00a018,00
{"title": "Avenger di John Byrne", "link": ["http://comics.panini.it/store/pub_ita_it/momae005isbn-it-omnibus-avengers-epic-collecti-marvel-omnibus-avengers-by.html"], "price": ["\n                    \u20ac\u00a022,50                ", "\n                    \u20ac\u00a076,50                ", "\n                    \u20ac\u00a022,50  

简而言之,请求传递每个项目的链接列表,因此价格不是唯一的,而是列表的结果。你知道吗

如何只传递相关项目的链接并存储每个项目的价格?你知道吗


Tags: 项目storehttp链接responseit页面price
2条回答

我看到两种方法:

使用response.xpath在子页面中获取它

def parse_isbn(self, response):
    l = response.meta['l']

    price = response.xpath("//p[@class='special-price']//span/text()")
    # ... do something with price ...

    return l.load_item()

或者在主页上得到所有需要的信息-标题,链接和价格的div

for sel in response.xpath('//div[@id="products-list"]/div'):
    l.add_xpath('title', './/h3/a/text()')
    l.add_xpath('link', './/h3/a/@href')
    l.add_xpath('price', './/p[@class="special-price"]//span/text()')

然后就不必使用parse_isbn


对于测试,我使用了独立脚本,您可以将它放在一个文件中运行,而无需创建项目。你知道吗

它正确地得到了价格。你知道吗

import scrapy

def clean(text):
    text = text.replace('\xa0', ' ')
    text = text.strip().split('\n')
    text = ' '.join(x.strip() for x in text)
    return text

class PaniniSpider(scrapy.Spider):

    name = "spiderP"
    start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]

    def parse(self, response):
        for sel in response.xpath('//div[@id="products-list"]/div'):
            yield {
                'title': clean(sel.xpath('.//h3/a/text()').get()),
                'link':  clean(sel.xpath('.//h3/a/@href').get()),
                'price': clean(sel.xpath('.//p[@class="special-price"]//span/text()').get()),
            }     

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', # 
})
c.crawl(PaniniSpider)
c.start()

编辑:如果必须加载其他页面,则可以将add_valueresponse.xpath().get()一起使用,而不是add_xpath

def parse_isbn(self, response):
    l = response.meta['l']

    l.add_value('price', response.xpath("//p[@class='special-price']//span/text()").get())

    return l.load_item() 

完整示例:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose

def clean(text):
    text = text.replace('\xa0', ' ')
    text = text.strip().split('\n')
    text = ' '.join(x.strip() for x in text)
    return text

class ComicscraperItem(scrapy.Item):
    title = scrapy.Field(input_processor=MapCompose(clean))
    link = scrapy.Field()
    price = scrapy.Field(input_processor=MapCompose(clean))

class PaniniSpider(scrapy.Spider):

    name = "spiderP"
    start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]

    def parse(self, response):
        # Get all the <a> tags
        for sel in response.xpath("//div[@class='list-group']//h3/a"):
            l = ItemLoader(item=ComicscraperItem(), selector=sel)
            l.add_xpath('title', './text()')
            l.add_xpath('link', './@href')

            request = scrapy.Request(sel.xpath('./@href').extract_first(), callback=self.parse_isbn, dont_filter=True)
            request.meta['l'] = l
            yield request

    def parse_isbn(self, response):
        l = response.meta['l']
        l.add_value('price', response.xpath("//p[@class='special-price']//span/text()").get())
        return l.load_item()   

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', # 
})
c.crawl(PaniniSpider)
c.start()

通过继承scrapy的项目加载器并应用默认的\u output\u processor=TakeFirst()来创建项目加载器

例如

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class DefaultItemLoader(ItemLoader):
    link_output_processor = TakeFirst()

你也可以参考我的项目https://github.com/yashpokar/amazon-crawler/blob/master/amazon/loaders.py

相关问题 更多 >