如何编写scrapy哪个起始url是前一个spider的输出?

2024-10-02 10:27:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我写了下面这样的网站地图:

class filmnetmapSpider(SitemapSpider):
      name = "filmnetmapSpider"
      sitemap_urls = ['http://filmnet.ir/sitemap.xml']
      sitemap_rules = [
            ('/series/', 'parse_item')
      ]
      def parse_item(self, response):
         videoid = response.xpath('/loc/text()').extract()

并从中提取所有url

我想写另一个scrapy spider,它的start_url是上一个spider(sitemapSpider)的输出

我该怎么做??在


Tags: namehttpurlparse网站response地图item
2条回答

假设您从第一个spider获得csv格式的输出,下面的代码将逐行读取该文件,并使用xpath将其擦除。在

class Stage2Spider(scrapy.Spider):
name = 'stage2'
allowed_domains = []
start_urls = []
read_urls = open('collecturls.csv', 'r')
for url in read_urls.readlines():
    url = url.strip() 
    allowed_domains = allowed_domains + [url[4:]]
    start_urls = start_urls + [url]
read_urls.close()

希望有帮助。在

您需要某种数据库或文件来存储一个spider的结果并在另一个spider中读取它们。在

class FirstSpider(Spider):
    """First spider crawls something end stores urls in file, 1 url per newline"""
    name = 'first'
    start_urls = ['someurl']
    storage_file = 'urls.txt'

    def parse(self, response):
        urls = response.xpath('//a/@href').extract()
        with open(self.storage_file, 'a') as f:
            f.write('\n'.join(urls) + '\n')

class SecondSpider(Spider):
    """Second spider opens this file and crawls every line in it"""
    name = 'second'

    def start_requests(self):
        file_lines = open(FirstSpider.storage_file)
        for line in file_lines:
            if not line.strip():  # skip empty lines 
                continue
            yield Request(line.strip())

相关问题 更多 >

    热门问题