<p>我想出来了。用BeautifulSoup替换regex以简化解析,我根据div标记之间的文本长度对这些版本进行排序,以避免在查找子字符串时遇到任何问题。你知道吗</p>
<p>使用相同的样本:</p>
<pre class="lang-none prettyprint-override"><code>Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely
</code></pre>
<p>行在一个列表中表示,然后使用BeautifulSoup按相应div标记之间的文本长度排序。代码如下:</p>
<pre class="lang-py prettyprint-override"><code>def __merge_strings(final_str, version):
soup = BeautifulSoup(final_str, "html.parser")
for fixed_div in soup.find_all("div"):
if not fixed_div.text == version.text:
return final_str.replace(
version.text, unicode(version)
)
return final_str
found_terms = (
(i, BeautifulSoup(i, "html.parser").find("div"))
for i in found_terms
) # list of pairs of the version and its div text
found_terms = sorted(
found_terms, key=lambda x: len(x[-1].text), reverse=True
) # sort on the length of the div text to avoid issues with substrings
current_div = found_terms[0][0] # version with the largest div text
for i in xrange(1, len(found_terms)):
current_div = __merge_strings(current_div, found_terms[i][-1])
</code></pre>