用占位符替换字符串,并在函数后替换它们。

2024-09-27 09:33:00 发布

您现在位置:Python中文网/ 问答频道 /正文

给定一个字符串和一个子字符串列表,这些子字符串应该被替换为占位符,例如

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

第一个目标是首先将original_text中来自phrases的子字符串替换为索引占位符,例如

^{pr2}$

[出来]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

然后会有一些函数用占位符操作text,例如

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

结果是:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

最后一步是以一种倒退的方式进行替换,并将原来的短语放回原处

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[出来]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

问题是:

  1. 如果phrases中的子串列表很大,则第一次更换和最后一次背板的时间将非常长。在

有没有办法用正则表达式进行更换/背板固定?

  1. 使用re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)正则表达式替换不是很有帮助,特别是如果短语中有子字符串与完整单词不匹配

例如

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

我们得到了一个尴尬的结果:

^{9}$

我试过使用'\b{}\b'.format(phrase),但这对带有标点符号的短语不起作用

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[出来]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

对于re.subregex模式中的短语,是否存在一些表示单词边界的地方?


Tags: textformatinstitutionsomethingikfinancialoriginalphrases
3条回答

而不是使用re.sub公司你可以把它分开!在

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

以下是上例的变量值:

^{pr2}$

如您所见,列表包含元组。它们包含两个值-some stringboolean,表示它是文本还是被替换的值(True当文本时)。 在得到替换列表之后,可以像示例一样修改它,检查它是否是文本值(if x[1] == True)。 希望有帮助!在

p.S.字符串格式如f"some string here {some_variable_here}"需要python3.6

下面是一个你可以使用的策略:

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

# need this module for the reduce function
import functools as fn

#convert phrases into a dictionary of numbered placeholders (tokens)
tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }

#replace embedded phrases with their respective token
tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)

#Apply text cleaning logic on the tokenized text 
#This assumes the placeholders are left untouched, 
#although it's ok to move them around)
cleaned_text = cleanUpfunction(tokenized)

#reverse the token dictionary (to map original phrases to numbered placeholders)
unTokens = {v:k for k,v in tokens.items() }

#rebuild phrases with original text associated to each token (placeholder)
final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)

我认为在这个任务中使用正则表达式有两个关键点:

  1. 使用自定义边界,捕捉它们,并将它们与短语一起替换回来。

  2. 使用函数在两个方向上处理替换匹配。

下面是一个使用这种方法的实现。我稍微修改了一下你的文字,重复其中一个短语。在

import re
from copy import copy 

original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
text = copy(original_text)

#
# The phrases of interest
#
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]

#
# Create the mapping dictionaries
#
phrase_to_mwe = {}
mwe_to_phrase = {}

#
# Build the mappings
#
for i, phrase in enumerate(phrases):

    mwephrase                = "MWEPHRASE{}".format(i)
    mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
    phrase_to_mwe[phrase]    = mwephrase

#
# Regex match handlers
#
def handle_forward(match):

    b1     = match.group(1)
    phrase = match.group(2)
    b2     = match.group(3)

    return b1 + phrase_to_mwe[phrase] + b2


def handle_backward(match):

    return mwe_to_phrase[match.group(1)]

#
# The forward regex will look like:
#
#    (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
# 
# which captures three components:
#
#    (1) Front boundary
#    (2) Phrase
#    (3) Back boundary
#
# Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
# added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
#
#    ([,; ]|$)
#
regex_forward  = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
regex_backward = re.compile(r'(MWEPHRASE\d+)')

#
# Pretend we cleaned the text in the middle
#
cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'

#
# Do the translations
#
text1 = regex_forward .sub(handle_forward,  text)
text2 = regex_backward.sub(handle_backward, cleaned)

print('original: {}'.format(original_text))
print('text1   : {}'.format(text1))
print('text2   : {}'.format(text2))

运行此操作将生成:

^{pr2}$

相关问题 更多 >

    热门问题