我正试图从网站https://threetwinsicecream.com/products/ice-cream/上搜集与冰淇淋相关的数据。这似乎是一个非常简单的网站。然而,我想,我无法让我的爬行器工作,因为(JavaScript)弹出窗口阻碍了我的访问。我在下面附上了我的剪贴代码的浓缩版本:
class NutritionSpider(scrapy.Spider):
name = 'nutrition'
allowed_domains = ['threetwinsicecream.com']
start_urls = ['http://threetwinsicecream.com/']
def parse(self, response):
products = response.xpath("//div[@id='pints']/div[2]/div")
for product in products:
name = product.xpath(".//a/p/text()").extract_first()
link = product.xpath(".//a/@href").extract_first()
yield scrapy.Request(
url=link,
callback=self.parse_products,
meta={
"name": name,
"link": link
}
)
def parse_products(self, response):
name = response.meta["name"]
link = response.meta["link"]
serving_size = response.xpath("//div[@id='nutritionFacts']/ul/li[1]/text()").extract_first()
calories = response.xpath("//div[@id='nutritionFacts']/ul/li[2]/span/text()").extract_first()
yield {
"Name": name,
"Link": link,
"Serving Size": serving_size,
"Calories": calories
}
我设计了一个变通方法,但它需要手动写出各种冰淇淋的所有链接,如下所示。我也尝试过在网站上禁用JavaScript,但这似乎也不起作用
def parse(self, response):
urls = [
"https://threetwinsicecream.com/products/ice-cream/madagascar-vanilla/",
"https://threetwinsicecream.com/products/ice-cream/sea-salted-caramel/",
...
]
for url in urls:
yield scrapy.Request(
url=url,
callback=self.parse_products
)
def parse_products(self, response):
pass
有没有办法使用scrapy绕过弹出窗口,或者我必须使用其他工具,如selenium?谢谢你的帮助
你贴的蜘蛛很管用。至少在我的机器上。我唯一需要改变的是
start_urls = ['http://threetwinsicecream.com/']
到start_urls = ['https://threetwinsicecream.com/products/ice-cream/']
如果遇到此类问题,可以使用Scrapys
open_in_browser
函数,通过该函数可以查看Scrapy在浏览器中看到的内容。它被记录在案here相关问题 更多 >
编程相关推荐