如何确保在我的Scrapy spid中解析每个URL

2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=1 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=2 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=3 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=4 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=5 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=6 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=7 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=8 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=9 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=10 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=11 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=12 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=13 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=14 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=15 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=16 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=17 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=18 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=19 2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=20

import scrapy from bs4 import BeautifulSoup class QuotesSpider(scrapy.Spider): name = "recipes" def start_requests(self): urls = [] for i in range (1, 60): curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i self.log(curr_url) urls.append(curr_url) for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): soup = BeautifulSoup(response.body, "html.parser") page_links = soup.find_all(class_="post-summary") for link in page_links: with open("links.txt", "a") as f: f.write(link.a["href"] + "\n")

https://pinchofyum.com/5-minute-vegan-yogurt https://pinchofyum.com/red-curry-noodles https://pinchofyum.com/15-minute-meal-prep-cauliflower-fried-rice-with-crispy-tofu https://pinchofyum.com/5-ingredient-vegan-vodka-pasta https://pinchofyum.com/lentil-greek-salads-with-dill-sauce https://pinchofyum.com/coconut-oil-granola-remix https://pinchofyum.com/quinoa-crunch-salad-with-peanut-dressing https://pinchofyum.com/15-minute-meal-prep-cilantro-lime-chicken-and-lentils https://pinchofyum.com/instant-pot-sweet-potato-tortilla-soup https://pinchofyum.com/garlic-butter-baked-penne https://pinchofyum.com/15-minute-meal-prep-creole-chicken-and-sausage https://pinchofyum.com/lemon-chicken-soup-with-orzo https://pinchofyum.com/brussels-sprouts-tacos https://pinchofyum.com/14-must-bake-holiday-cookie-recipes https://pinchofyum.com/how-to-cook-chicken

1条回答

网友

1楼 · 发布于 2024-06-26 00:01:47

我的理解是，您希望确保urls内的每个页面都被成功地刮取，并且其中包含链接，如果是，请参阅下面的代码

import scrapy
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class QuotesSpider(scrapy.Spider):
    name = "recipes"
    urls = []

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def start_requests(self):
        for i in range (1, 60):
            curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
            self.log(curr_url)
            self.urls.append(curr_url)
            yield scrapy.Request(url=curr_url, callback=self.parse)

    def parse(self, response):
        page_links = response.css(".post-summary")   
        if len(page_links)>0:
            del self.urls[response.url] #delete from URLS to confirm that it has been parsed
            for link in page_links:
                with open("links.txt", "a") as f:
                    f.write(link.a["href"] + "\n")


    def spider_closed(self, spider):
        self.log("Following URLs were not parsed: %s"%(self.urls))

它所做的是将所有要刮到self.urls中的URL追加到self.urls中，一旦刮到一个URL并且其中包含链接，它就会从self.urls中删除

注意还有另一个方法叫做spider_closed，它只在scraper完成时执行，所以它将打印没有被scraper或没有链接的url

还有，为什么要用BeautifulSoup？只需使用Python Scrapy的选择器类

相关问题更多 >

编程相关推荐

热门问题

热门文章