使用一个粗略的spid保持数据流分离

2条回答

网友

1楼 · 编辑于 2024-06-02 09:42:05

在你的蜘蛛中，像这样交出你的物品。你知道吗

data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }

其中每个条目都包含一个Python字典。你知道吗

然后创建一个管道，在管道内部，这样做。你知道吗

if 'categories' in item:
   categories = item['categories']
   # and then process categories, save into DB maybe

if 'contracts' in item:
   categories = item['contracts']
   # and then process contracts, save into DB maybe
.
.
.
# And others

网友

2楼 · 编辑于 2024-06-02 09:42:05

我建议你从另一个角度来解决这个问题。在scrapy中，可以使用-a选项从命令行向spider传递参数，如下所示

scrapy crawl CanCrawler -a contract=goods

您只需要在类初始值设定项中包含引用的变量

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

你还可以考虑添加多个参数，这样你就可以从一个网站的主页开始，使用这些参数，你就可以得到你需要的任何数据。例如，对于这个网站https://buyandsell.gc.ca/procurement-data/search/site，您可以有两个命令行参数。你知道吗

    scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods

所以你会得到

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, procure='', contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

然后根据您传递的参数，您可以让爬虫程序单击网站上的这些选项，以获取要爬网的数据。 Please also see here。我希望这有帮助！你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用一个粗略的spid保持数据流分离

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >