我试图在一个食品博客上抓取食谱列表的每一页,抓取每一页上的食谱URL,并将它们全部写入一个.txt文件。我的代码目前运行正常,但仅适用于urls
方法内的start_requests
中列出的第一个URL
我已经添加了一个.log()
来检查urls
是否确实包含我试图从中获取的所有正确的URL,并且当我在命令提示符中执行Scrapy时,我得到以下确认,即它们在那里:
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=1
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=2
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=3
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=4
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=5
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=6
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=7
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=8
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=9
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=10
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=11
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=12
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=13
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=14
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=15
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=16
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=17
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=18
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=19
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=20
等等
我的当前代码:
import scrapy
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
name = "recipes"
def start_requests(self):
urls = []
for i in range (1, 60):
curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
self.log(curr_url)
urls.append(curr_url)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
soup = BeautifulSoup(response.body, "html.parser")
page_links = soup.find_all(class_="post-summary")
for link in page_links:
with open("links.txt", "a") as f:
f.write(link.a["href"] + "\n")
运行上述命令时,会将以下输出写入links.txt:
https://pinchofyum.com/5-minute-vegan-yogurt
https://pinchofyum.com/red-curry-noodles
https://pinchofyum.com/15-minute-meal-prep-cauliflower-fried-rice-with-crispy-tofu
https://pinchofyum.com/5-ingredient-vegan-vodka-pasta
https://pinchofyum.com/lentil-greek-salads-with-dill-sauce
https://pinchofyum.com/coconut-oil-granola-remix
https://pinchofyum.com/quinoa-crunch-salad-with-peanut-dressing
https://pinchofyum.com/15-minute-meal-prep-cilantro-lime-chicken-and-lentils
https://pinchofyum.com/instant-pot-sweet-potato-tortilla-soup
https://pinchofyum.com/garlic-butter-baked-penne
https://pinchofyum.com/15-minute-meal-prep-creole-chicken-and-sausage
https://pinchofyum.com/lemon-chicken-soup-with-orzo
https://pinchofyum.com/brussels-sprouts-tacos
https://pinchofyum.com/14-must-bake-holiday-cookie-recipes
https://pinchofyum.com/how-to-cook-chicken
这里的链接是正确的,但应该有50多页的价值
有什么建议吗?我错过了什么
我的理解是,您希望确保
urls
内的每个页面都被成功地刮取,并且其中包含链接,如果是,请参阅下面的代码它所做的是将所有要刮到
self.urls
中的URL追加到self.urls
中,一旦刮到一个URL并且其中包含链接,它就会从self.urls
中删除注意还有另一个方法叫做
spider_closed
,它只在scraper完成时执行,所以它将打印没有被scraper或没有链接的url还有,为什么要用BeautifulSoup?只需使用Python Scrapy的选择器类
相关问题 更多 >
编程相关推荐