保存“起始URL”并正确存储在数据框中

asin_product product_name ProductA,,,ProductB,,,ProductC,,, BrandA,,,BrandB,,,BrandC,,, ProductA,,,ProductD,,,ProductE,,, BrandA,,,BrandB,,,BrandA,,, #Note that the rows are representing the start_urls and that the ',,,' #three commas are separating the data.

scrapy crawl myspider -o items.csv Start_URL asin_product product_name URL1 ProductA BrandA URL1 ProductB BrandB URL1 ProductC BrandC URL2 ProductA BrandA URL2 ProductD BrandB URL2 ProductE BrandA

import scrapy from amazon.items import AmazonItem class AmazonProductSpider(scrapy.Spider): name = "AmazonDeals" allowed_domains = ["amazon.com"] #Use working product URL below start_urls = [ "https://www.amazon.com/s?k=shoes&ref=nb_sb_noss_2", # This should be #URL 1 "https://www.amazon.com/s?k=computer&ref=nb_sb_noss_2" # This should be #URL 2 ] def parse(self, response): items = AmazonItem() title = response.xpath('//*[@class="a-size-base-plus a-color-base a- text-normal"]/text()').extract() asin = response.xpath('//*[@class ="a-link-normal"]/@href').extract() # Note that I devided the products with ',,,' to make it easy to separate # them. I am aware that this is not the best approach. items['product_name'] = ',,,'.join(title).strip() items['asin_product'] = ',,,'.join(asin).strip() yield items

2条回答

网友

1楼 · 编辑于 2024-05-21 05:27:23

使起始url在parse方法中可用

您可以从名为start\u requests的方法（参见https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=start_requests#our-first-spider）生成初始请求，而不用使用start\u url。你知道吗

对于每个请求，可以将起始url作为元数据传递。这些元数据随后在您的解析方法中可用（参见https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta）。你知道吗

def start_requests(self):
    urls = [...]  # this is equal to your start_urls
    for start_url in urls:
        yield Request(url=url, meta={"start_url": start_url})

def parse(self, response):
    start_url = response.meta["start_url"]

产生多个项目，每个产品一个

您可以从parse中生成多个项目，而不是连接标题和品牌。对于下面的示例，我假设列表标题和asin具有相同的长度。你知道吗

for title, asin in zip(title, asin):
    item = AmazonItem()
    item['product_name'] = title
    item['asin_product'] = asin
    yield item

PS：你应该去看看亚马逊机器人.txt. 他们可能不允许您刮取他们的站点并禁止您的IP（https://www.amazon.de/robots.txt）

网友

2楼 · 编辑于 2024-05-21 05:27:23

首先，它是recomended to use css when querying by class。你知道吗

现在回到你的代码：

产品名称在a标记（产品url）中。因此，您可以遍历链接并存储URL和标题。你知道吗

<a class="a-link-normal a-text-normal" href="/adidas-Mens-Lite-Racer-Running/dp/B071P19D3X/ref=sr_1_3?keywords=shoes&amp;qid=1554132536&amp;s=gateway&amp;sr=8-3">
    <span class="a-size-base-plus a-color-base a-text-normal">Adidas masculina Lite Racer byd tênis de corrida</span>
</a>

您需要在csv文件的每行创建一个AmazonItem对象。你知道吗

def parse(self, response):

    # You need to improve this css selector because there are links which
    # are not a product, this is why I am checking if title is None and continuing.
    for product in response.css('a.a-link-normal.a-text-normal'):
        # product is a selector
        title = product.css('span.a-size-base-plus.a-color-base.a-text-normal::text').get()
        if not title:
            continue
        # The selector is already the a tag, so we only need to extract it's href attribute value.
        asin =  product.xpath('./@href').get()

        item = AmazonItem()
        item['product_name'] = title.strip()
        item['asin_product'] = asin.strip()

        yield item

相关问题更多 >

编程相关推荐

热门问题

热门文章