使用scrapy查找网站crawing中重复的url计数

2024-09-29 21:40:21 发布

您现在位置:Python中文网/ 问答频道 /正文

如何找到重复的网址在一个网站?因为scrativeframework在默认情况下不会刮取重复的url。我只需要找到哪些网址是重复和多少次。在

我试图通过计算close spider函数上重复的url的数量来实现这一点,但是经过一番挖掘,我意识到在这个函数中我们不能产生任何结果。在


Tags: 函数urlclose数量网站情况spider网址
2条回答

This Scrapy Documentation可能会帮助您开始。这个代码可能会对你有所帮助。在

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        links_count = {}
        for link in response.css('a').xpath('@href').extract():
            if link in links_count:
                links_count[link] += 1
            else:
                links_count[link] = 1
        yield links_count

运行:

^{pr2}$

结果:

{' https://wordpress.org/': 1, 'https://github.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/#comments': 1, 'https://www.instagram.com/scrapinghub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/#comments': 1, 'https://blog.scrapinghub.com/2017/07/07/scraping-the-steam-game-store-with-scrapy/': 4, 'https://scrapinghub.com/': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/#comments': 1, 'https://www.youtube.com/channel/UCYb6YWTBfD0EB53shkN_6vA': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/': 4, 'https://www.facebook.com/ScrapingHub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/': 3, 'https://blog.scrapinghub.com/author/andre-perunicic/': 1, 'http://blog.scrapinghub.com/rss': 1, 'https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/': 1, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/': 4, 'https://blog.scrapinghub.com/page/2/': 1, 'https://scrapinghub.com/data-on-demand': 1, 'https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/': 1, 'https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/': 1, 'https://blog.scrapinghub.com/author/kmike84/': 1, 'https://blog.scrapinghub.com/author/cchaynessh/': 3, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/#comments': 1, 'https://blog.scrapinghub.com/about/': 1, 'https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016/': 1, 'https://www.linkedin.com/company/scrapinghub': 1, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/#respond': 1, 'https://blog.scrapinghub.com/author/valdir/': 3, 'https://plus.google.com/+Scrapinghub': 1, 'https://blog.scrapinghub.com/author/scott/': 2, 'https://scrapinghub.com/data-services/': 1, 'https://blog.scrapinghub.com/': 2, 'https://blog.scrapinghub.com/2017/04/19/deploy-your-scrapy-spiders-from-github/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/': 3, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/#comments': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/#comments': 1, 'https://twitter.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/': 3, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/#comments': 1, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/': 4, 'https://wordpress.org/themes/nisarg/': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/': 3}

如果查看RFPDupeFilterhere的源代码,可以看到它记录了过滤后的请求数。在

如果在子类中修改log()方法,则可以以最小的工作量获得每个url的结果。在

像这样简单的方法就可以做到这一点,或者您可能想进一步细化它(确保设置了^{}设置):

class URLStatsRFPDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super().log(request, spider)
        spider.crawler.stats.inc_value(
            'dupefilter/filtered/{}'.format(request.url),
            spider=spider
        )

相关问题 更多 >

    热门问题