在Python中合并具有相同内容但重叠HTML标记的多个字符串

Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely <div style="font-size: 4">Created</div> and managed websites for clients to communicate securely

import re div_str = "<div style=.*</div>" # the div tags div_text_str = "(?<=(>)).*(?=(</div>))" # the content inside the div tags # compile the regexes div_regex = re.compile(div_str) div_text_regex = re.compile(div_text_str) def merge_strings(str1, str2): # grab the div tag off the first version div = div_regex.search(str1).group() # grab the contents of that div tag div_text = div_text_regex.search(div).group() # find the div content in the second version, then substitute # with the div tag return re.sub(div_text, div, str2)

2条回答

网友

1楼 · 编辑于 2024-09-21 01:18:51

这不是一个恰当的答案。你知道吗

我要提到的是，用regex解析HTML通常会给生活带来不必要的困难。最好使用诸如BeautifulSoup、lxml、scrapy等解析器

从你作为例子提供的每一行中恢复文本是很容易的。我假设每一个都是一个更大的构造的一部分；因此我将每个都包含在一个div中。你知道吗

在这里，我使用BeautifulSoup从您的每一行中获取文本。你知道吗

>>> for line in open('temp.htm').readlines():
...     line = line.strip()
...     print(line)
...     soup = bs4.BeautifulSoup(line, 'lxml')
...     soup.find('div').text
...     
<div>Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'
<div>Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'
<div>Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely</div>
'Created and managed websites for clients to communicate securely'
<div><div style="font-size: 4">Created</div> and managed websites for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'

不幸的是，我不明白通常如何将输入行映射到输出HTML。你知道吗

网友

2楼 · 编辑于 2024-09-21 01:18:51

我想出来了。用BeautifulSoup替换regex以简化解析，我根据div标记之间的文本长度对这些版本进行排序，以避免在查找子字符串时遇到任何问题。你知道吗

使用相同的样本：

Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely

行在一个列表中表示，然后使用BeautifulSoup按相应div标记之间的文本长度排序。代码如下：

def __merge_strings(final_str, version):

    soup = BeautifulSoup(final_str, "html.parser")

    for fixed_div in soup.find_all("div"):
        if not fixed_div.text == version.text:
            return final_str.replace(
                version.text, unicode(version)
            )

    return final_str

found_terms = (
    (i, BeautifulSoup(i, "html.parser").find("div"))
    for i in found_terms
)  # list of pairs of the version and its div text
found_terms = sorted(
    found_terms, key=lambda x: len(x[-1].text), reverse=True
)  # sort on the length of the div text to avoid issues with substrings

current_div = found_terms[0][0]  # version with the largest div text
for i in xrange(1, len(found_terms)):
    current_div = __merge_strings(current_div, found_terms[i][-1])

相关问题更多 >

编程相关推荐

热门问题

热门文章