我正在做一个课程项目,但我从亚马逊获得的数据缺少产品名称、价格和类别。因为我没有API的AWS帐户,所以我决定根据我拥有的ASIN(产品ID)来获取这些信息。但我对网页抓取还不是很了解(比如XML结构)。代码中的抓取部分是根据一个功能性论坛抓取项目改编的,但它在这里不起作用
我还尝试了BeautifulSoup,我甚至从类似的亚马逊项目中找到了它,但它也不起作用。由于硒的用途更广泛,我真的更喜欢用这种方式学习。下面是代码,使用非功能性XPath:
from selenium import webdriver
from random import randint
asin_set = ['0151004714', '0380709473','0511189877', '0528881469', '0545105668', '0557348153', '0594033926', '0594296420', '0594450268', '0594451647', '0594459451', '0594481902', '059449771X']
driver = webdriver.Chrome()
list_of_dicts[:] = []
print('This is gonna be LEGEN... wait for it:')
for i in asin_set[:5]:
url = f'https://www.amazon.com/gp/product/{i}'
driver.get(url)
product_info = {}
product_info['asin'] = i
try:
name = driver.find_elements_by_xpath('//*[@id="' + x + '"]') #<---
product_info['name'] = name.text('productTitle') #<---
except:
product_info['name'] = 0
try:
price = driver.find_elements_by_xpath('//*[@id="' + x + '"]') #<---
product_info['price'] = price.text #<---
except:
product_info['price'] = 0
try:
category = driver.find_elements_by_xpath('//*[@id="' + x + '"]/ul/li[5]/span/a') #<---
product_info['category'] = category.get_attribute('wayfinding-breadcrumbs_feature_div') #<---
except:
product_info['category'] = 0
list_of_dicts.append(product_info) # Append scrape to dictionary
print(str(len(list_of_dicts)) + ' . ', end='') # print the current length of the scrapes
sleep(randint(1,2)) # Sleep 1 or 2 seconds in bewteen scrapes
print('DARY!')
单元格运行正常,浏览器打开每个页面。但是东西没有被正确地访问或存储,我从清单中得到的结果是:
[{'asin': '0151004714', 'name': 0, 'price': 0, 'category': 0},
{'asin': '0380709473', 'name': 0, 'price': 0, 'category': 0},
{'asin': '0511189877', 'name': 0, 'price': 0, 'category': 0},
{'asin': '0528881469', 'name': 0, 'price': 0, 'category': 0},
{'asin': '0545105668', 'name': 0, 'price': 0, 'category': 0}]
使用
WebDriverWait
()并等待visibility_of_element_located
()并使用以下xpath代替睡眠控制台输出:
这将是勒根。。。等待它: 1.2.3.4.5.6.7.8.9 . 10 . 11 . 12 . 13 . 达里
相关问题 更多 >
编程相关推荐