2024-10-03 09:16:21 发布
网友
假设我想爬网这个站点:https://www.alibaba.com/consumer-electronics/action-sports-camera/p44_p201340102?spm=a2700.8293689.HomeLeftCategory.d201340102.2f9a67afhxyQdZ
https://www.alibaba.com/consumer-electronics/action-sports-camera/p44_p201340102?spm=a2700.8293689.HomeLeftCategory.d201340102.2f9a67afhxyQdZ
是否可以打开第一个产品,例如标题、价格和图片,然后返回到概述页面,并对下一个产品执行相同操作,直到涵盖所有产品
这个想法非常简单。页面中的所有链接仅在向下滚动时加载,因此必须使用selenium滚动到页面末尾。滚动到页面末尾后,必须使用driver.page_source获取网站的html代码,并使用BeautifulSoup对其进行解析,以便提取所有链接。以下是您的操作方法:
selenium
driver.page_source
BeautifulSoup
from bs4 import BeautifulSoup import requests from selenium import webdriver import time driver = webdriver.Chrome() driver.get('https://www.alibaba.com/consumer-electronics/action-sports-camera/p44_p201340102?spm=a2700.8293689.HomeLeftCategory.d201340102.2f9a67afhxyQdZ') lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;") match=False while(match==False): lastCount = lenOfPage time.sleep(1) lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;") if lastCount==lenOfPage: match=True time.sleep(3) html = driver.page_source driver.close() soup = BeautifulSoup(html,'html5lib') div_tags = soup.find_all('div', class_ = "grid-col-item") links = [] for div in div_tags: links.append(div.div.a['href']) print(links)
输出:
['//www.alibaba.com/product-detail/2020-Full-HD-4k-1080P-go_62556989288.html', '//www.alibaba.com/product-detail/Followsun-50-in-1-Accessories-for_62065838705.html', '//www.alibaba.com/product-detail/Factory-lowest-Price-720p-action-camera_60828536337.html', '//www.alibaba.com/product-detail/New-Product-2-0-Inch-Ltps_62394746927.html', '//www.alibaba.com/product-detail/Waterproof-full-hd-1080p-720p-sport_1600084796811.html', '//www.alibaba.com/product-detail/2020-Full-HD-1080P-Go-pro_62555774741.html', '//www.alibaba.com/product-detail/A7-Action-Camera-4k-HD720P-Sports_62255736516.html', '//www.alibaba.com/product-detail/Sports-Camera-4K-Action-Camera-Ultra_62504138600.html', '//www.alibaba.com/product-detail/2016-Hot-sale-Xiaomi-Yi-Action_60434045578.html' ... '//www.alibaba.com/product-detail/Promotion-item-wide-angle-action-camera_60819668707.html']
编辑:
以下是您要刮取的实际网站的代码:
from bs4 import BeautifulSoup import requests r = requests.get('https://video.xortec.de/search?sSearch=hikvision&p=1&o=1&n=24%22').text soup = BeautifulSoup(r,'html5lib') a_tags = soup.find_all('a', class_ = "product title") links = [] for a in a_tags: links.append(a['href']) print(links)
['https://video.xortec.de/hikvision-ds-2df4220-dx-w/316l', 'https://video.xortec.de/hikvision-ds-2td2137-35/py', 'https://video.xortec.de/hikvision-ds-2td2137-25/py', 'https://video.xortec.de/hikvision-ds-2td2137-15/py', 'https://video.xortec.de/hikvision-ds-2td2137-10/py', 'https://video.xortec.de/hikvision-ds-2td2137-7/py', 'https://video.xortec.de/hikvision-ds-2td2137-4/py', 'https://video.xortec.de/hikvision-ds-2td2137-4/v1', 'https://video.xortec.de/hikvision-ds-2df8c842ixs-ael-t2', 'https://video.xortec.de/hikvision-ds-2df8a442ixs-af/sp-t2', 'https://video.xortec.de/hikvision-ds-2de5432iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5425w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5425iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5330w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5232w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5232iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de5225w-ae-e', 'https://video.xortec.de/hikvision-ds-2de5225iw-ae-e', 'https://video.xortec.de/hikvision-ds-2de4425w-de-e', 'https://video.xortec.de/hikvision-ds-2de4225w-de-e', 'https://video.xortec.de/hikvision-ds-2de4215w-de-e']
下面是我的代码,以提高可视性:
import requests from bs4 import BeautifulSoup r = requests.get("https://video.xortec.de/search?sSearch=hikvision&p=1&o=1&n=24") soup = BeautifulSoup(r.text, "html.parser") products = soup.find_all('div', class_ = "product detail-btn") links = [] for product in products: links.append(product.a['href']) print(links)
我现在如何浏览该列表以抓取文章?看起来我的真实站点比我的示例站点简单得多
这个想法非常简单。页面中的所有链接仅在向下滚动时加载,因此必须使用
selenium
滚动到页面末尾。滚动到页面末尾后,必须使用driver.page_source
获取网站的html代码,并使用BeautifulSoup
对其进行解析,以便提取所有链接。以下是您的操作方法:输出:
编辑:
以下是您要刮取的实际网站的代码:
输出:
下面是我的代码,以提高可视性:
我现在如何浏览该列表以抓取文章?看起来我的真实站点比我的示例站点简单得多
相关问题 更多 >
编程相关推荐