BeautifulSoup找不到所有的div标记

2024-09-30 20:26:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经开始了一个私人项目:在VisualStudio代码(1.41.0)中使用Python和BeautifulSoup进行web抓取

我能够刮掉另一个与我的“问题站点”结构相同的站点。但是现在我遇到了,BeautifulSoup并没有找到所有的div标签(每个站点应该有20个,我只找到3个)。我已经告知自己堆栈溢出,但没有找到解决方案(或者显然不理解)

网站:https://www.comparis.ch/gesundheit/arzt/pathologie

我感兴趣的html结构如下所示:

enter image description here

enter image description here

enter image description here

我从<div class="css-fh99y9 excbu0j0">...</div>中获得所有的<div class="css-15dj4ut"></div>,但从<div class="css-roynbj excbu0j0"></div>中没有。你知道为什么吗

在每个url上迭代以访问每个站点

for i in range(0, endIndex):
try:
    if i == 0:
        urls.append(basicUrl)
        page = urllib.request.urlopen(urls[i])
        soup = BeautifulSoup(page, 'html.parser')

        getSurgeonName(soup)

    else:
        urls.append(basicUrl + urlAddon + str(i + 1))
        page = urllib.request.urlopen(urls[i])
        soup = BeautifulSoup(page, 'html.parser')

        getSurgeonName(soup)

except:
    print("An URL request error occured.")

功能版本1:

def getSurgeonName(soup):
    # gets just first 3 surgeons of site
    docName = re.compile('css-15dj4ut')
    docNameTags = soup.find_all('div', attrs={'class': docName})
    for a in docNameTags:
            docNameList.append(a.getText())

功能版本2:

def getSurgeonName(soup):

    parentClass = re.compile('css-fh99y9 excbu0j0')
    parentItems = soup.find_all('div', attrs={'class': parentClass})

    for parent in parentItems:
           children = parent.findChildren('div', {"class": "css-15dj4ut"}) 
           docNameList.append(children[0].getText())

    parentClass = re.compile('css-roynbj excbu0j0')
    parentItems = soup.find_all('div', attrs={'class': parentClass})

    for parent in parentItems:
           children = parent.findChildren('div', {'class': 'css-15dj4ut'}) 
           docNameList.append(children[0].getText())

Tags: indivfor站点pageurlscssclass
1条回答
网友
1楼 · 发布于 2024-09-30 20:26:09

实际上,所需的desired数据是通过页面动态加载的JavaScript加载的,因此requests包将无法动态呈现JavaScript。但是我已经能够找到script标记,它保存着JSON{}的string中的数据,然后将它加载到JSON

在这里,您可以解析任何需要的内容:)

import requests
from bs4 import BeautifulSoup
import json

r = requests.get("https://www.comparis.ch/gesundheit/arzt/pathologie")
soup = BeautifulSoup(r.content, 'html.parser')
script = soup.find("script", {'id': '__NEXT_DATA__'}).text

data = json.loads(script)

print(data.keys())  # JSON Dict

dumper = json.dumps(data, indent=4)

print(dumper)  # to see it in human readble format

比如:

for item in data['props']['pageProps']['doctorResults']['doctorModels']:
    print(item['name'])

输出:

Mohamed Abdou
Dr. med. Heiner Adams
Dr. med. Franziska Aebersold
Prof. Dr. med. Adriano Aguzzi
Dr. med. Maria Ammann
Prosper Anani
Dr. med. Max Arnaboldi
Dr. med. Walter Arnold
Dr. med. Irena Baltisser
Dr. med. Fridolin Bannwart
Dr. med. Yara Banz
Dr. med. André Barghorn
Dr. Jessica Barizzi
Prof. Dr. med. Daniel Baumhoer
Audrey Baur Chaubert
Dr. med. Christian Georg Bayerl
Dr. med. Marc Beer
Dr. med. Sabina Berezowska
Dr. med. Steffen Bergelt
Dr. med. Barbara Elisabeth Berger-Denzler

相关问题 更多 >