HTML分割给定ch上

2024-10-01 00:15:57 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我用靓汤来阅读网页的html。你知道吗

req = urllib.request.Request('https://en.wikipedia.org/wiki/Barack_Obama', headers = headers)
html = urllib.request.urlopen(reqx)
page = BeautifulSoup(html,'html.parser')

我想在句点上拆分html代码,条件是当句点位于p标记以外的另一个标记中时它不会拆分。 例如,如果html代码是:

<p>In June 2015, the Court ruled 6–3 in <i><a href="/wiki/King_v._Burwell" 
title="King v. Burwell">King v. Burwell</a></i> that subsidies to help individuals 
and families purchase health insurance were authorized for those doing so on both 
the federal exchange and state exchanges, not only those purchasing plans 
"established by the State", as the statute reads.</p>

我不介意在p标记中拆分句点,但不介意在a标记或任何其他标记中拆分句点。将html代码转换为字符串,然后进行拆分显然行不通。我不想使用Beautiful Soup的get\u text()方法然后在此基础上拆分的主要原因是,我希望拆分发生在原始html上。beautiful soup是否有内置的拆分功能,我可以在其中检查它是否在正确的标签上拆分?或者有没有别的办法?提前感谢:)

因此,我需要的输出是代码分成2部分:

<p>In June 2015, the Court ruled 6–3 in <i><a href="/wiki/King_v._Burwell" 
title="King v. Burwell">King v


 . Burwell</a></i> that subsidies to help individuals and families purchase health insurance were authorized for those doing so on both the federal exchange and state exchanges, not only those purchasing plans "established by the State", as the statute reads.</p>

Tags: andthe代码in标记requesthtmlwiki