与BeautifulSoup.find混淆?

2024-09-30 18:17:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图抓住大学的律师参加了一个特定的律师事务所,但我不知道如何抓住这两所大学列出的链接:https://www.wlrk.com/attorney/hahn/。如第一幅图片所示,这位律师就读的两所大学分别被贴上了“li”的标签

当我运行下面的代码时,我只得到第一个'li'标记末尾的html(如第二个链接图像中所示),而不是第二个li部分,因此我只得到第一所大学“Carleton College:”

import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "html.parser")    
education = personal_soup.find("div",{'class':'attorney--education'})
education.li.a.text # 'Carleton University'

html code snippetoutput


Tags: httpsimportcom链接htmlwwwli大学
2条回答

bs只获取第一个li元素。我不知道为什么。如果你想尝试使用lxml,这里有一个方法

import lxml
from lxml import html


url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})

tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney education']//li/a/text()")

print(education)

输出:

['Carleton College', 'New York University School of Law']

更改您的解析器,我将使用select并直接指向a元素lxml'更宽容,可以处理不应该出现的错误的结束a标记。而且,find只会返回第一个匹配,而find_all则返回所有匹配

例如

<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>

Stray end tag a.

From line 231, column 127; to line 231, column 130

ollege</a></a>, 2013

Stray end tag a.

From line 231, column 239; to line 231, column 242

of Law</a></a>, J.D.

source

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")    
educations = [a.text for a in personal_soup.select('.attorney education a')]
print(educations)

相关问题 更多 >