为产品页面链接抓取易趣特色收藏

2024-05-20 20:21:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python和beauthoulsoup构建一个web抓取工具,它可以进入一个eBay特色的收藏,并检索该收藏中所有产品的url(大多数收藏有17个产品,尽管有些产品或多或少有一些)。下面是我试图在代码中获取的集合的URL:http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018

以下是我目前为止的代码:

import requests
from bs4 import BeautifulSoup

url = 'http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

product_links = []

item_thumb = soup.find_all('div', attrs={'class':'itemThumb'})
for link in item_thumb:
    product_links.append(link.find('a').get('href'))

print product_links

这个刮板应该附加17个链接到列表产品链接。然而,它只能起到部分作用。具体地说,它每次只抓取前12个产品链接,剩下的5个未被触及,尽管所有17个链接都在相同的HTML标记和属性中找到。仔细查看页面的HTML代码,我发现的唯一区别是,前12个链接和最后5个链接由我在这里包含的一段XML脚本分隔开:

^{pr2}$

这个脚本的功能是什么?有没有可能这个脚本是我的铲运机忽略了最后5个环节的原因?有没有一种方法可以绕过这一点,并在最后五场比赛中脱颖而出?在


Tags: 代码脚本comhttpurl产品链接www
1条回答
网友
1楼 · 发布于 2024-05-20 20:21:33

最后几个是通过ajax请求http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018生成的:

enter image description here

这个url是使用ebayhomeditor和必须是某个产品id324079803018组成的,它们都在您访问的页面的原始url中。在

获取数据所必需的唯一参数是itemsPerPage,但您可以使用其他参数,看看它们有什么效果。在

params =  {"itemsPerPage": "10"}
soup= BeautifulSoup(requests.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content)
print([a["href"] for a in soup.select("div.itemThumb div.itemImg.image.lazy-image a[href]")])

这会给你:

^{pr2}$

所以把它放在一起就可以得到所有的网址:

In [23]: params = {"itemsPerPage": "10"}

In [24]: with requests.Session() as s:
   ....:         soup1 = BeautifulSoup(s.get('http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018').content,
   ....:                               "html.parser")
   ....:         main_urls = [a["href"] for a in soup1.select("div.itemThumb div.itemImg.image.lazy-image a[href]")]
   ....:         soup2 = BeautifulSoup(s.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content,
   ....:                               "html.parser")
   ....:         print(len(main_urls))
   ....:         main_urls.extend(a["href"] for a in soup2.select("div.itemThumb div.itemImg.image.lazy-image a[href]"))
   ....:         print(main_urls)
   ....:         print(len(main_urls))
   ....:     
12
['http://www.ebay.com/itm/archi-desk-accessories-pen-cup-designed-by-hsunli-huang-for-moma/262435041373?hash=item3d1a58f05d', 'http://www.ebay.com/itm/moorea-seal-violet-light-crane-scissors/201600302323?hash=item2ef0507cf3', 'http://www.ebay.com/itm/kikkerland-photo-holder-with-6-magnetic-wooden-clothespin-mh69-cable-47-long/361394782932?hash=item5424cec2d4', 'http://www.ebay.com/itm/authentic-22-design-studio-merge-concrete-pen-holder-desk-office-pencil/331846509549?hash=item4d4397e3ed', 'http://www.ebay.com/itm/supergal-bookend-by-artori-design-ad103-metal-black/272273290322?hash=item3f64c0b452', 'http://www.ebay.com/itm/elago-p2-stand-for-ipad-tablet-pcchampagne-gold/191527567203?hash=item2c97eebf63', 'http://www.ebay.com/itm/this-is-ground-mouse-pad-pro-ruler-100-authentic-natural-retail-100/201628986934?hash=item2ef2062e36', 'http://www.ebay.com/itm/hot-fuut-foot-rest-hammock-under-desk-office-footrest-mini-stand-hanging-swing/152166878943?hash=item236dda4edf', 'http://www.ebay.com/itm/unido-silver-white-black-led-desk-office-lamp-adjustable-neck-brightness-level/351654910666?hash=item51e0441aca', 'http://www.ebay.com/itm/in-house-black-desk-office-organizer-paper-clips-memo-notes-monkey-business/201645856763?hash=item2ef30797fb', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/yamazaki-home-tower-book-end-white-stationary-holder-desktop-organizing-steel/171836462366?hash=item280240551e', 'http://www.ebay.com/itm/tetris-constructible-interlocking-desk-lamp-neon-light-nightlight-by-paladone/221571335719?hash=item3396ae4627', 'http://www.ebay.com/itm/iphone-docking-station-dock-native-union-new-in-box/222202878086?hash=item33bc52d886', 'http://www.ebay.com/itm/turnkey-pencil-sharpener-silver-office-home-school-desk-gift-peleg-design/201461359979?hash=item2ee808656b', 'http://www.ebay.com/itm/himori-weekly-times-desk-notepad-desktop-weekly-scheduler-30-weeks-planner/271985620013?hash=item3f539b342d']
19

In [25]: 

返回的内容有点重叠,所以只需使用一个集合来存储列表中的主URL或调用集:

In [25]: len(set(main_urls))
Out[25]: 17

不知道为什么会发生这种情况,而且还没有真正尝试解决它,如果它困扰您,那么您可以从ajax调用返回的源代码中解析“totalItems:17”,并在第一次调用后减去main_urls的长度,并设置{"itemsPerPage": str(len(main_urls) - int(parsedtotal))},但我不会对此太担心。在

相关问题 更多 >