使用XPath抓取Amazon表的特定部分时出错

2024-10-02 14:26:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从amazon项目页面中获取一些数据。我的代码相当简单:

def getAsinData(asin, proxy):
    url = 'https://www.amazon.com/dp/' + asin

    s = HTMLSession()
    r = s.get(url, proxies={'http://': proxy, 'https://': proxy}, timeout=2)
    r.html.render(sleep=1)

    product = {
        'title': r.html.xpath('//*[@id="productTitle"]', first=True).text,
        'price': r.html.xpath('//*[@id="priceblock_ourprice"]', first=True).text,
        'details': r.html.xpath('/html/body/div[2]/div[2]/div[8]/div[22]/div/div/div/div[1]/div/div/table/tbody/tr[6]/td/span/span[1]', first=True).text
    }

    print(product)
    return(product)

问题在于当我尝试刮取页面的这一特定部分时: Amazon Item Rank

因为我使用的是XPath,我似乎无法简单地复制和粘贴它的位置,因为它没有ID或文本。它返回以下错误:

File "c:\scrape.py", line 37, in getAsinData        
    'details': r.html.xpath('/html/body/div[2]/div[2]/div[8]/div[22]/div/div/div/div[1]/div/div/table/tbody/tr[6]/td/span/span[1]', first=True).text
AttributeError: 'NoneType' object has no attribute 'text'

所以我试着提升HTML的级别,直到我找到了一个ID为prodDetails的div,我用它来抓取图像中可以看到的整个表。问题是它显然返回了整个表:

'details': 'Product information\nPackage Dimensions\n18.7 x 4.7 x 4.5 inches\nItem Weight\n2.05 pounds\nManufacture/2 Inch Thick Kitchen Floor Mats, Holiday Gather 18" x 30"', 'price': '$26.99', 'details': 'Product information\nPackage Dimensions\n17.1 x 4.r\nSoHome\nASIN\nB089HPSSZF\nCustomer Reviews\n/* * Fix for UDP-1061. Average customer reviews hn/* * Fix for UDP-1061. Average customer reviews has a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTemplas a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTemplates */ .noUnderline a:hover { text-decoration: none; }\n4.7 out of 5 stars 44 ratings P.when(\'A\', \'ready\').execute(function(A) { A.declarativ/retail/SoftlinesDetailPageAssets/udp-intl-lock/src/legacy.css?indexName=WebsiteTemplates#40 */ w.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.decla
.noUnderline a:hover { text-decoration: none; }\n4.2 out of 5 stars 23 ratings P.when(\'A\', \'r{ if(window.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickCount") || 0) + 1); } }); });\n\n4.7 oueady\').execute(function(A) { A.declarative(\'acrLink-click-metrics\', \'click\', { "allowLinkDe#144 in Floor Comfort Mats\n\nDate First Available\nJune 1, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this prfault" : true }, function(event){ if(window.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.declarative(\/2 Inch Thick Kitchen Floor Mats, Damask Grey 18" x 30"', 'price': '$26.99', 'details': 'Product information\nPackage Dimensions\n18.7 x 4.7 x'acrStarsLink-click-metrics\', \'click\', { "allowLinkDefault" : true }, function(event){ if(winn/* * Fix for UDP-1061. Average customer reviews has a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTempldow.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickC */ .noUnderline a:hover { text-decoration: none; }\n4.2 out of 5 stars 23 ratings P.when(\'A\', \'ready\').execute(function(A) { A.declarativount") || 0) + 1); } }); });\n\n4.2 out of 5 stars\nBest Sellers Rank\n#114,617 in Kitchen & Dinw.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.declaing (See Top 100 in Kitchen & Dining)\n#304 in Floor Comfort Mats\n\nDate First Available\nJuly { if(window.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickCount") || 0) + 1); } }); });\n\n4.2 ou
24, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this product, plen#304 in Floor Comfort Mats\n\nDate First Available\nJuly 24, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this ase click here\nFeedback\nWould you like to tell us about a lower price?'}

所以我在想,必须有一种方法以xpath为目标,只返回排名的值,或者我只需要以某种方式解析整个表,并提取出我所需要的内容。如果有人能就如何明确排名目标给出建议或帮助我想出一种从整个表格中解析排名的方法,我将不胜感激


Tags: textinhttpsdivamazonexecuteinformationhtml
1条回答
网友
1楼 · 发布于 2024-10-02 14:26:18

看起来您的功能设计用于任何ASIN,但我预见到的一个问题是,Amazon的详细信息页面不一致。产品详细信息部分的HTML层次结构将根据布局而改变

但是,假设您只关心这个特定的ASIN(B084BN9PWN),那么为什么不迭代保存您想要的畅销书排名信息的标记

例如:

browse_rank_count = r.html.xpath('//*[@id="productDetails_detailBullets_sections1"]/tbody/tr[6]/td/span/span').getall()
for i in range(len(browse_rank_count)):
    i += 1  # xpath index starts at 1
    browse_rank_string = r.html.xpath('//*[@id="productDetails_detailBullets_sections1"]/tbody/tr[6]/td/span/span[' + str(i) + ']/text()').get()
    print(browse_rank_string)

我还没有对此进行测试,所以xpath可能有点不合适

我使用一个名为ProxyCrawl的付费代理服务,它与基本代理服务一起提供一个Amazon scraper,为您进行抓取。您只需为ASIN请求所需的数据。这也是一个可以考虑的选项,尤其是如果你已经支付了代理费。或者,您可以为每个细节页面变体映射出一系列xpath尝试

相关问题 更多 >