使用Scrapy(美味冰淇淋)绕过弹出窗口

2024-09-30 20:19:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从网站https://threetwinsicecream.com/products/ice-cream/上搜集与冰淇淋相关的数据。这似乎是一个非常简单的网站。然而,我想,我无法让我的爬行器工作,因为(JavaScript)弹出窗口阻碍了我的访问。我在下面附上了我的剪贴代码的浓缩版本:

class NutritionSpider(scrapy.Spider):
    name = 'nutrition'
    allowed_domains = ['threetwinsicecream.com']
    start_urls = ['http://threetwinsicecream.com/']

    def parse(self, response):
        products = response.xpath("//div[@id='pints']/div[2]/div")
        for product in products:
            name = product.xpath(".//a/p/text()").extract_first()
            link = product.xpath(".//a/@href").extract_first()

            yield scrapy.Request(
                url=link,
                callback=self.parse_products,
                meta={
                    "name": name,
                    "link": link
                }
            )

    def parse_products(self, response):
        name = response.meta["name"]
        link = response.meta["link"]

        serving_size = response.xpath("//div[@id='nutritionFacts']/ul/li[1]/text()").extract_first() 

        calories = response.xpath("//div[@id='nutritionFacts']/ul/li[2]/span/text()").extract_first()

        yield {
            "Name": name,
            "Link": link,
            "Serving Size": serving_size,
            "Calories": calories
        }

我设计了一个变通方法,但它需要手动写出各种冰淇淋的所有链接,如下所示。我也尝试过在网站上禁用JavaScript,但这似乎也不起作用

def parse(self, response):

        urls = [
            "https://threetwinsicecream.com/products/ice-cream/madagascar-vanilla/",
            "https://threetwinsicecream.com/products/ice-cream/sea-salted-caramel/",
            ...
        ]

        for url in urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse_products
            )

def parse_products(self, response):
        pass

有没有办法使用scrapy绕过弹出窗口,或者我必须使用其他工具,如selenium?谢谢你的帮助


Tags: nameselfdivcomurlparseresponsedef
1条回答
网友
1楼 · 发布于 2024-09-30 20:19:56

你贴的蜘蛛很管用。至少在我的机器上。我唯一需要改变的是start_urls = ['http://threetwinsicecream.com/']start_urls = ['https://threetwinsicecream.com/products/ice-cream/']

如果遇到此类问题,可以使用Scrapysopen_in_browser函数,通过该函数可以查看Scrapy在浏览器中看到的内容。它被记录在案here

相关问题 更多 >