与BeautifulSoup.find混淆？

import requests from bs4 import BeautifulSoup as soup url = 'https://www.wlrk.com/attorney/hahn/' res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'}) personal_soup = soup(res.content, "html.parser") education = personal_soup.find("div",{'class':'attorney--education'}) education.li.a.text # 'Carleton University'

2条回答

网友

1楼 · 编辑于 2024-09-30 18:17:40

bs只获取第一个li元素。我不知道为什么。如果你想尝试使用lxml，这里有一个方法

import lxml
from lxml import html


url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})

tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney education']//li/a/text()")

print(education)

输出：

['Carleton College', 'New York University School of Law']

网友

2楼 · 编辑于 2024-09-30 18:17:40

更改您的解析器，我将使用select并直接指向a元素lxml'更宽容，可以处理不应该出现的错误的结束a标记。而且，find只会返回第一个匹配，而find_all则返回所有匹配

例如

<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>

Stray end tag a.
From line 231, column 127; to line 231, column 130
ollege</a></a>, 2013
Stray end tag a.
From line 231, column 239; to line 231, column 242
of Law</a></a>, J.D.

source

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")    
educations = [a.text for a in personal_soup.select('.attorney education a')]
print(educations)

相关问题更多 >

编程相关推荐

热门问题

热门文章