使用Python和beautifulsoup进行Web抓取：beautifulsoup函数保存了什么？

from bs4 import BeautifulSoup import urllib.request import re url = "https://www.tipico.de/de/live-wetten/" try: page = urllib.request.urlopen(url) except: print(“An error occured.”) soup = BeautifulSoup(page, ‘html.parser’) regex = re.compile(‘c_but_base c_but’) content_lis = soup.find_all(‘button’, attrs={‘class’: regex}) print(content_lis)

from bs4 import BeautifulSoup import urllib.request import re url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4" try: page = urllib.request.urlopen(url) except Exception as e: print(f"An error occurred: {e}") soup = BeautifulSoup(page, 'html.parser') regex = re.compile('ui-touchlink-needsclick price odd-price') content_lis = soup.find_all('button', attrs={'class': regex}) print(content_lis)

1条回答

网友
1楼 · 发布于 2024-06-15 00:20:54

这是因为该网站正在使用JavaScript来显示这些详细信息，而BeautifulSoup本身并不与JS进行交互
首先试着找出你想要刮取的元素是否存在于页面源代码中，如果是这样，你可以刮取，几乎所有的东西！在您的情况下，按钮/span标记不在页面源中（表示隐藏或通过脚本提取）
页面源中没有<button>标记：
因此，我建议使用硒作为解决方案，并尝试了网站的基本刮擦
以下是我使用的代码：
from selenium import webdriver option = webdriver.ChromeOptions() option.add_argument(' headless') option.binary_location = r'Your chrome.exe file path' browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option) browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4") span_tags = browser.find_elements_by_tag_name('span') for span_tag in span_tags: print(span_tag.text) browser.quit()
这是输出：
此输出中存在一些垃圾数据，但这是为了让您了解您需要什么和不需要什么

相关问题更多 >

编程相关推荐

热门问题

热门文章