关于使用beautifulsoup解析html的问题

2024-09-30 01:29:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我刚开始学习在Python中使用BeautifulSoup解析html,有一个非常简单的愚蠢问题。不知何故,我只是无法从下面的html(存储在容器中)中获取文本1

....
<div class="listA">
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
</div>
...
soup = BeautifulSoup(driver.page_source, 'html.parser')
containers = soup.findAll("div", {"class": "listA"})
datas = []
for data in containers:
    textspan = data.find("span")
    datas.append(textspan.text)

输出如下:Text1Text2Text3

有没有什么建议来界定它们呢?谢谢,非常感谢


Tags: text文本divdatahtmldriver容器class
2条回答

另一个解决方案涉及simplifieddoc,它不依赖第三方库,而且更轻、更快,非常适合初学者。 这里有更多的例子here

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
'''
doc = SimplifiedDoc(html)
span = doc.span # Get the outermost span
first = span.span # Get the first span in span
print (first.text)
second = span.b
print (second.text)
third = second.next
print (third.text)

结果:

Text 1
Text 2
Text 3

如果您只想文本1使用此代码

import bs4

content = "<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>"
soup = bs4.BeautifulSoup(content, 'html.parser')


# soup('span') will give you
# [<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>, <span>Text 1</span>]

span_text = soup('span')

for e in span_text:
    if not e('span'):
        print(e.text) 

输出:

Text 1

相关问题 更多 >

    热门问题