使用BeautifulSoup从Amazon抓取整个类别的产品

<script type='text/javascript'>var ue_t0=ue_t0||+new Date();</script>  <meta http-equiv='x-dns-prefetch-control' content='on'> <link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com"> <link rel="dns-prefetch" href="https://m.media-amazon.com"> <link rel="dns-prefetch" href="https://completion.amazon.com"> <script type='text/javascript'> window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1; if (window.ue_ihb === 1) {

2条回答

网友

1楼 · 编辑于 2024-09-30 05:19:19

如果您使用google inspector，您将在指向所需URL的图像上找到href。例如，您找到的第一个Samsum TV的href位于以下Xpath处：

/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/div[2]/div[2]/ul/li[1]/span/div/a

从这里开始，您需要找到一种方法来概括搜索

网友

2楼 · 编辑于 2024-09-30 05:19:19

您需要一个选择器，该选择器以src以.jpg结尾的所有img为目标，但还需要排除几个其他早期匹配项。使用:not和前面的.a-row可以做到这一点。最后，您需要使用set来清除唯一项

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://www.amazon.es/b/ref=sv_ap_arrow_ce_4_1_1_1?node=934359031', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
images = set(i['src'] for i in soup.select('.a-row img[src$=jpg]:not(.bxc-grid__row:nth-child(1) img[src$=jpg])'))
pprint(images)

相关问题更多 >

编程相关推荐

热门问题

热门文章