使用XPath抓取Amazon表的特定部分时出错

def getAsinData(asin, proxy): url = 'https://www.amazon.com/dp/' + asin s = HTMLSession() r = s.get(url, proxies={'http://': proxy, 'https://': proxy}, timeout=2) r.html.render(sleep=1) product = { 'title': r.html.xpath('//*[@id="productTitle"]', first=True).text, 'price': r.html.xpath('//*[@id="priceblock_ourprice"]', first=True).text, 'details': r.html.xpath('/html/body/div[2]/div[2]/div[8]/div[22]/div/div/div/div[1]/div/div/table/tbody/tr[6]/td/span/span[1]', first=True).text } print(product) return(product)

File "c:\scrape.py", line 37, in getAsinData 'details': r.html.xpath('/html/body/div[2]/div[2]/div[8]/div[22]/div/div/div/div[1]/div/div/table/tbody/tr[6]/td/span/span[1]', first=True).text AttributeError: 'NoneType' object has no attribute 'text'

'details': 'Product information\nPackage Dimensions\n18.7 x 4.7 x 4.5 inches\nItem Weight\n2.05 pounds\nManufacture/2 Inch Thick Kitchen Floor Mats, Holiday Gather 18" x 30"', 'price': '$26.99', 'details': 'Product information\nPackage Dimensions\n17.1 x 4.r\nSoHome\nASIN\nB089HPSSZF\nCustomer Reviews\n/* * Fix for UDP-1061. Average customer reviews hn/* * Fix for UDP-1061. Average customer reviews has a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTemplas a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTemplates */ .noUnderline a:hover { text-decoration: none; }\n4.7 out of 5 stars 44 ratings P.when(\'A\', \'ready\').execute(function(A) { A.declarativ/retail/SoftlinesDetailPageAssets/udp-intl-lock/src/legacy.css?indexName=WebsiteTemplates#40 */ w.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.decla .noUnderline a:hover { text-decoration: none; }\n4.2 out of 5 stars 23 ratings P.when(\'A\', \'r{ if(window.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickCount") || 0) + 1); } }); });\n\n4.7 oueady\').execute(function(A) { A.declarative(\'acrLink-click-metrics\', \'click\', { "allowLinkDe#144 in Floor Comfort Mats\n\nDate First Available\nJune 1, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this prfault" : true }, function(event){ if(window.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.declarative(\/2 Inch Thick Kitchen Floor Mats, Damask Grey 18" x 30"', 'price': '$26.99', 'details': 'Product information\nPackage Dimensions\n18.7 x 4.7 x'acrStarsLink-click-metrics\', \'click\', { "allowLinkDefault" : true }, function(event){ if(winn/* * Fix for UDP-1061. Average customer reviews has a small extra line on hover * https://omni-grok.amazon.com/xref/src/appgroup/websiteTempldow.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickC */ .noUnderline a:hover { text-decoration: none; }\n4.2 out of 5 stars 23 ratings P.when(\'A\', \'ready\').execute(function(A) { A.declarativount") || 0) + 1); } }); });\n\n4.2 out of 5 stars\nBest Sellers Rank\n#114,617 in Kitchen & Dinw.ue) { ue.count("acrLinkClickCount", (ue.count("acrLinkClickCount") || 0) + 1); } }); }); P.when(\'A\', \'cf\').execute(function(A) { A.declaing (See Top 100 in Kitchen & Dining)\n#304 in Floor Comfort Mats\n\nDate First Available\nJuly { if(window.ue) { ue.count("acrStarsLinkWithPopoverClickCount", (ue.count("acrStarsLinkWithPopoverClickCount") || 0) + 1); } }); });\n\n4.2 ou 24, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this product, plen#304 in Floor Comfort Mats\n\nDate First Available\nJuly 24, 2020\nWarranty & Support\nProduct Warranty: For warranty information about this ase click here\nFeedback\nWould you like to tell us about a lower price?'}

1条回答

网友

1楼 · 发布于 2024-10-02 14:26:18

看起来您的功能设计用于任何ASIN，但我预见到的一个问题是，Amazon的详细信息页面不一致。产品详细信息部分的HTML层次结构将根据布局而改变

但是，假设您只关心这个特定的ASIN（B084BN9PWN），那么为什么不迭代保存您想要的畅销书排名信息的标记

例如：

browse_rank_count = r.html.xpath('//*[@id="productDetails_detailBullets_sections1"]/tbody/tr[6]/td/span/span').getall()
for i in range(len(browse_rank_count)):
    i += 1  # xpath index starts at 1
    browse_rank_string = r.html.xpath('//*[@id="productDetails_detailBullets_sections1"]/tbody/tr[6]/td/span/span[' + str(i) + ']/text()').get()
    print(browse_rank_string)

我还没有对此进行测试，所以xpath可能有点不合适

我使用一个名为ProxyCrawl的付费代理服务，它与基本代理服务一起提供一个Amazon scraper，为您进行抓取。您只需为ASIN请求所需的数据。这也是一个可以考虑的选项，尤其是如果你已经支付了代理费。或者，您可以为每个细节页面变体映射出一系列xpath尝试

相关问题更多 >

编程相关推荐

热门问题

热门文章