通过哪个库以及如何通过标题和段落标记在HTML上抓取文本？

1条回答

网友

1楼 · 发布于 2024-05-08 17:12:46

遍历树并收集所有<p>标记（这些标记的级别越来越高）<h>可以使用BeautifulSoup完成：

html = '''
<h1>House rule</h1>
    <h2>Rule 1</h2>
        <p>A</p>
        <p>B</p>
    <h2>Rule 2</h2>
        <h3>Rule 2.1</h3>
            <p>C</p>
        <h3>Rule 2.2</h3>
            <p>D</p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

counter = 1
all_leafs = []
while True:
    htag = 'h%d'%counter
    hgroups =  soup.findAll(htag)
    print(htag,len(hgroups))
    counter += 1
    if len(hgroups) == 0: 
        break
    for hgroup in hgroups:
        for c,descendant in enumerate(hgroup.find_all_next()):
            name = getattr(descendant, "name", None)
            if name == 'p':
                all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)

。。。你知道吗

h1 1
h2 2
h3 2
h4 0
[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]

相关问题更多 >

编程相关推荐

热门问题

热门文章

通过哪个库以及如何通过标题和段落标记在HTML上抓取文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >