<p>这可以使用BeautifulSoup来完成,方法是使用<code>extract()</code>删除不需要的<code><p></code>元素,然后使用<code>new_tag()</code>创建一个新的<code><p></code>标记,其中包含所有删除元素的文本。例如:</p>
<pre><code>html = """<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler1</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
<p class="calibre5" id="calibre_pb_62">Note for Tyler2</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html>"""
from bs4 import BeautifulSoup
from itertools import groupby
import re
soup = BeautifulSoup(html, "html.parser")
for level, group in groupby(soup.find_all("p", class_=re.compile(r"calibre\d")), lambda x: x["class"][0]):
if level == "calibre1":
calibre1 = list(group)
p_new = soup.new_tag('p', attrs={"class" : "calibre1"})
p_new.string = ' '.join(p.get_text(strip=True) for p in calibre1)
calibre1[0].insert_before(p_new)
for p in calibre1:
p.extract()
print(soup.prettify())
</code></pre>
<p>会将HTML作为:</p>
<pre class="lang-html prettyprint-override"><code><?xml version='1.0' encoding='Latin1'?>
<html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">
Note for Tyler1
</p>
<p class="calibre1">
In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642
</p>
<p class="calibre5" id="calibre_pb_62">
Note for Tyler2
</p>
<p class="calibre1">
In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642
</p>
</body>
</html>
</code></pre>
<p>它通过查找<code>calibre1</code>标记的运行来工作。对于每个运行,它首先合并所有运行的文本,并在第一个运行之前插入一个新标记。然后删除所有旧标签。你知道吗</p>
<p>对于EPUB文件中更复杂的场景,可能需要修改逻辑,但这应该可以帮助您开始。你知道吗</p>