使用BeautifulSoup4在<div>部分中获取带/不带<p>标记的字符串

3条回答

网友

1楼 · 编辑于 2024-09-27 23:27:25

下面的方法似乎有效

import bs4

soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")

main = soup.find('div', id='essay')
for child in main.children:
    print(child.string)

网友

2楼 · 编辑于 2024-09-27 23:27:25

BeautfulSoup4还有递归模式，默认情况下是启用的。在

from bs4 import BeautifulSoup
html = """
<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)

很适合我。尝试使用pip更新beauthoulsoup4。在

网友

3楼 · 编辑于 2024-09-27 23:27:25

如果列表的长度相同，那么将它们交错起来可能会更容易，而不是编写代码用漂亮的汤来绕过原始格式

from itertools import chain

list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']

print(list(chain.from_iterable(zip(list_a, list_b))))


# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']

更多信息：Interleaving Lists in Python

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulSoup4在<div>部分中获取带/不带<p>标记的字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >