Python BeautifulSoup从在线商店抓取多个产品

2024-10-03 09:16:21 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我想爬网这个站点:https://www.alibaba.com/consumer-electronics/action-sports-camera/p44_p201340102?spm=a2700.8293689.HomeLeftCategory.d201340102.2f9a67afhxyQdZ

是否可以打开第一个产品,例如标题、价格和图片,然后返回到概述页面,并对下一个产品执行相同操作,直到涵盖所有产品


Tags: httpscomconsumer产品站点wwwactioncamera
2条回答

这个想法非常简单。页面中的所有链接仅在向下滚动时加载,因此必须使用selenium滚动到页面末尾。滚动到页面末尾后,必须使用driver.page_source获取网站的html代码,并使用BeautifulSoup对其进行解析,以便提取所有链接。以下是您的操作方法:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get('https://www.alibaba.com/consumer-electronics/action-sports-camera/p44_p201340102?spm=a2700.8293689.HomeLeftCategory.d201340102.2f9a67afhxyQdZ')
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
        lastCount = lenOfPage
        time.sleep(1)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True
time.sleep(3)
html = driver.page_source
driver.close()

soup = BeautifulSoup(html,'html5lib')

div_tags = soup.find_all('div', class_ = "grid-col-item")

links = []

for div in div_tags:
    links.append(div.div.a['href'])

print(links)

输出:

['//www.alibaba.com/product-detail/2020-Full-HD-4k-1080P-go_62556989288.html', '//www.alibaba.com/product-detail/Followsun-50-in-1-Accessories-for_62065838705.html', '//www.alibaba.com/product-detail/Factory-lowest-Price-720p-action-camera_60828536337.html', '//www.alibaba.com/product-detail/New-Product-2-0-Inch-Ltps_62394746927.html', '//www.alibaba.com/product-detail/Waterproof-full-hd-1080p-720p-sport_1600084796811.html', '//www.alibaba.com/product-detail/2020-Full-HD-1080P-Go-pro_62555774741.html', '//www.alibaba.com/product-detail/A7-Action-Camera-4k-HD720P-Sports_62255736516.html', '//www.alibaba.com/product-detail/Sports-Camera-4K-Action-Camera-Ultra_62504138600.html', '//www.alibaba.com/product-detail/2016-Hot-sale-Xiaomi-Yi-Action_60434045578.html' ... '//www.alibaba.com/product-detail/Promotion-item-wide-angle-action-camera_60819668707.html']

编辑:

以下是您要刮取的实际网站的代码:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://video.xortec.de/search?sSearch=hikvision&p=1&o=1&n=24%22').text

soup = BeautifulSoup(r,'html5lib')

a_tags = soup.find_all('a', class_ = "product title")

links = []
for a in a_tags:
    links.append(a['href'])

print(links)

输出:

['https://video.xortec.de/hikvision-ds-2df4220-dx-w/316l', 'https://video.xortec.de/hikvision-ds-2td2137-35/py', 'https://video.xortec.de/hikvision-ds-2td2137-25/py', 'https://video.xortec.de/hikvision-ds-2td2137-15/py', 'https://video.xortec.de/hikvision-ds-2td2137-10/py', 'https://video.xortec.de/hikvision-ds-2td2137-7/py', 'https://video.xortec.de/hikvision-ds-2td2137-4/py', 'https://video.xortec.de/hikvision-ds-2td2137-4/v1', 'https://video.xortec.de/hikvision-ds-2df8c842ixs-ael-t2', 'https://video.xortec.de/hikvision-ds-2df8a442ixs-af/sp-t2', 'https://video.xortec.de/hikvision-ds-2de5432iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5425w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5425iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5330w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5232w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5232iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5225w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5225iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de4425w-de-e', 'https://video.xortec.de/hikvision-ds-2de4225w-de-e', 'https://video.xortec.de/hikvision-ds-2de4215w-de-e']

下面是我的代码,以提高可视性:

import requests
from bs4 import BeautifulSoup
r = requests.get("https://video.xortec.de/search?sSearch=hikvision&p=1&o=1&n=24")
soup = BeautifulSoup(r.text, "html.parser")
products = soup.find_all('div', class_ = "product detail-btn")

links = []

for product in products:
    links.append(product.a['href'])
print(links)

我现在如何浏览该列表以抓取文章?看起来我的真实站点比我的示例站点简单得多

相关问题 更多 >