从html中提取特定标记后的所有文本？

2024-09-30 06:19:52 发布

您现在位置：Python中文网/ 问答频道 /正文

9595

网友

男 | 程序猿一只，喜欢编程写python代码。

我想在第二次出现特定标记之后提取HTML文件的文本。你知道吗

我已经试过regex和bs4了，但是我不知道出了什么问题。Regex总是只给出hit本身，而没有html文件的其余部分，bs4就是不起作用，因为我不知道如何为它指定文件的结尾。你知道吗

简化：

<html>
    <veryspecific tag>
       abc
    </veryspecific tag>

    <stuff that comes before>
    </stuff that comes before>
    <...

       <veryspecific tag>
       abc
       </veryspecific tag>

       <other tags that come after>
       something
       </other tags that come after>
    </...>

    <other tags that come after2>
    something
    </other tags that come after2>
</html>

#I tried splitting it, so I can take the last part which should contain the end of the file, starting from the latest occurrence, but it did not work:

htmltxt.split(r'abc.*$')


# I also tried to get the last tag and try to "while" over the 2 to get the text:

last_tag = html_parsed.findall('a')[-1]

while specific_tag != last_tag:
   text = ...
   specific_tag = specific_tag.next

我找到了所需的标签，可以提取它，但我还需要文件的其余部分。有没有一个简单的方法来做这件事？你知道吗

Tags：文件 the to that html tag tags last

1条回答

网友

1楼 · 发布于 2024-09-30 06:19:52

下面是一个使用BeautifulSoup的建议：

mark = soup.find('veryspecific').find_next('veryspecific')
all_other_tags = mark.find_all_next(name=True)

print(''.join(i.text for i in all_other_tags))

它给我这个输出：

       something

    something

从html中提取特定标记后的所有文本？

相关问题更多 >

编程相关推荐

热门问题

热门文章

从html中提取特定标记后的所有文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >