如何在多个URL上运行爬行器,同时将字符串连接到URL

2024-09-30 10:27:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我想要一个蜘蛛在多个URL上运行。然而,我希望从用户那里获取输入,将其连接到我的原始URL,然后让蜘蛛抓取它们。下面是我为其中一个URL所做的

class ProductsSpider(scrapy.Spider):
    name = "gaming"

    def start_requests(self):
        product = input("Enter the item you are looking for")
        yield scrapy.Request(
            url=f'https://www.czone.com.pk/search.aspx?kw={product}',
            callback=self.parse
        )

    def parse(self, response):

对于一个URL,上面的代码运行得非常好。有多个url的一种方法是将一个列表作为起始url,但即使是url,爬行器也会返回一个错误: “[scrapy.core.engine]错误:获取启动请求值时出错错误:请求url:h中缺少方案” 请帮忙


Tags: 用户nameselfurlparsedef错误product
2条回答

检查此代码:

import scrapy


class ProductsSpider(scrapy.Spider):
    name = "gaming"

    def __init__(self, product='', **kwargs):
        self.start_urls = [
            f'https://www.czone.com.pk/search.aspx?kw={product}',
            f'https://pcfanatics.pk/search?type=product&q={product}',
            f'https://gtstore.pk/searchresults.php?inputString={product}',
        ]
        super().__init__(**kwargs)

    def start_requests(self):
        for s_url in self.start_urls:
            yield scrapy.Request(
                url=s_url,
                callback=self.parse,
            )

    def parse(self, response):
        print(self.name)
        ... do parse things ...

在scrapy spider中获取输入的正确方法是在运行时使用-a选项,例如,要运行此spider,您应该使用:

scrapy crawl gaming -a product='foo'

scrapy runspider <spider_filename> -a product='foo'

URL错误可能是由于格式错误,使用

            f'https://www.czone.com.pk/search.aspx?kw={product}',
            f'https://pcfanatics.pk/search?type=product&q={product}',
            f'https://gtstore.pk/searchresults.php?inputString={product}',

没有给我任何问题

根据你的问题,解决办法如下:

我的代码:

import scrapy

class ProductsSpider(scrapy.Spider):
    
    name = "games"
    
    product = input("laptop")
    product2 = input("desktop")
    product3 = input("cameras")
    
    def start_requests(self):
        
        urls =[f'https://www.czone.com.pk/search.aspx?kw={self.product}', f'https://www.czone.com.pk/search.aspx?kw={self.product2}', f'https://www.czone.com.pk/search.aspx?kw={self.product3}']
        
        for url in urls:
            
            yield scrapy.Request(
                url =url,
                callback=self.parse
            )

    def parse(self, response):
        pass
    

与备选方案相同:

代码:

import scrapy
class ProductsSpider(scrapy.Spider):
    name = "games2"
    product = input(["laptop","desktop","cameras"])
    
    def start_requests(self):
        yield scrapy.Request(
            url=f'https://www.czone.com.pk/search.aspx?kw={self.product}',
            callback=self.parse
            )

    def parse(self, response):
        pass
    

输出:

laptop
desktop
cameras
['laptop', 'desktop', 'cameras']

2021-08-12 16:53:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.czone.com.pk/search.aspx?kw=> (referer: None)
2021-08-12 16:53:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-12 16:53:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 312,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 19982,
 'downloader/response_count': 1,
 'downloader/response_status_count/200

相关问题 更多 >

    热门问题