管理regex替换索引映射的Python代码段?

2024-09-28 21:25:07 发布

您现在位置:Python中文网/ 问答频道 /正文

对于文本处理任务,我需要应用多个regex替换(即。回复sub). 有多个带有自定义替换参数的regex模式。结果需要是原始文本、带替换的文本和元组映射,标识源文本中替换字符串的开始索引、结束索引和结果文本中的索引。

例如。 下面是一个包含输入文本和3个修饰符元组数组的示例代码。你知道吗

text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''

modifiers = [
    (
        r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})', 
        { 1:lambda x:month(x), 2:lambda x:num2text(x), 3:lambda x:num2text(x) }
    ),
    (
        r' (\d) ', 
        { 1:lambda x:num2text(x) }
    ),
    (
        r'(culpa)', 
        { 1: 'culpae' }
    )
] 

输出索引图示例:

[((7, 11), (7, 30)), ((12, 14), (31, 35)), ((20, 22), (41, 51)), ((23, 28), (52, 57)),...]

已经编写了一个复杂的函数,它试图处理替换过程中发生的索引偏移的所有转角情况,但是它已经占用了太多时间。你知道吗

也许这个任务已经有解决方案了?你知道吗

这是a demo of current state。 单词转换扩展(标准化)函数有意简化为固定值dict映射。你知道吗

最终的目标是制作一个文本数据集生成器。数据集需要有两个文本部分-一个带有数字缩写和其他可展开字符串,另一个完全展开为完整的文本表示(例如3->;three,apr.->;april等),还需要偏移映射以将未展开文本的部分与展开文本中的相应部分链接起来。你知道吗

我的实现已经处理过的一个常见情况是,至少有两个修饰符a和B,它们必须处理像“text text a text B text a text B”这样的文本,因为第一个修饰符大量输出第二个“a”替换的输出范围变得不正确,因为B修饰符进入并在第二个修饰符之前更改输出文本“a”。你知道吗

还部分处理了后续修饰符替换第一个修饰符的输出替换并计算出初始源范围位置的情况。你知道吗

更新

Writing a python package称为re-map。 也可以考虑spacy提到的here。你知道吗


Tags: lambda字符串textin文本示例情况修饰符
2条回答

下面的代码示例使用redatetime和第三方包inflect处理文本修饰符。你知道吗

代码将返回修改后的文本以及修改后的单词的位置。你知道吗

PS:你需要解释更多你想做的事情。否则,您可以使用此代码并对其进行修改以满足您的需要。你知道吗

安装inflectpip install inflect

示例代码:

import re
from datetime import datetime
import inflect

ENGINE = inflect.engine()


def num2words(num):
    """Number to Words using inflect package"""
    return ENGINE.number_to_words(num)


def pretty_format_date(pattern, date_found, text):
    """Pretty format dates"""
    _month, _day, _year = date_found.groups()
    month = datetime.strptime('{day}/{month}/{year}'.format(
        day=_day, month=_month.strip('.'), year=_year
    ), '%d/%b/%Y').strftime('%B')
    day, year = num2words(_day), num2words(_year)
    date = '{month} {day}, {year} '.format(month=month, day=day, year=year)
    begin, end = date_found.span()
    _text = re.sub(pattern, date, text[begin:end])
    text = text[:begin] + _text + text[end:]
    return text, begin, end


def format_date(pattern, text):
    """Format given string into date"""
    spans = []
    # For loop prevents us from going into an infinite loop
    # If there is malformed texts or bad regex
    for _ in re.findall(pattern, text):
        date_found = re.search(pattern, text)
        if not date_found:
            break
        try:
            text, begin, end = pretty_format_date(pattern, date_found, text)
            spans.append([begin, end])
        except Exception:
            # Pass without any modification if there is any errors with date formats
            pass

    return text, spans


def number_to_words(pattern, text):
    """Numer to Words with spans"""
    spans = []
    # For loop prevents us from going into an infinite loop
    # If there is malformed texts or bad regex
    for _ in re.findall(pattern, text):
        number_found = re.search(pattern, text)
        if not number_found:
            break
        _number = number_found.groups()
        number = num2words(_number)
        begin, end = number_found.span()
        spans.append([begin, end])
        _text = re.sub(pattern, number, text[begin:end])
        text = text[:begin] + ' {} '.format(_text) + text[end:]
    return text, spans



def custom_func(pattern, text, output):
    """Custom function"""
    spans = []
    for _ in re.findall(pattern, text):
        _found = re.search(pattern, text)
        begin, end = _found.span()
        spans.append([begin, end])
        _text = re.sub(pattern, output, text[begin:end])
        text = text[:begin] + ' {} '.format(_text) + text[end:]
    return text, spans


text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''

modifiers = [
    (
        r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})',
        format_date
    ),
    (
        r' (\d) ',
        number_to_words
    ),
    (
        r'( \bculpa\b)',  # Better using this pattern to catch the exact word
        'culpae'
    )
]

for regex, func in modifiers:
    if not isinstance(func, str):
        print('\n{} {} {}'.format('#' * 20, func.__name__, '#' * 20))
        _text, spans = func(regex, text)
    else:
        print('\n{} {} {}'.format('#' * 20, func, '#' * 20))
        _text, spans = custom_func(regex, text, func)
    print(_text, spans)

输出:

#################### format_date ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On April six, two thousand and nine  Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit animid est laborum.
 [[128, 142]]

#################### number_to_words ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex five ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt six mollit anim id est laborum.
 [[231, 234], [463, 466]]

#################### culpae ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpae  minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpae  qui officia deserunt 6 mollit anim id est laborum.
 [[150, 156], [435, 441]]

演示Replit

Wrote一个re-mappython库来解决所描述的问题。你知道吗

这是a demo。你知道吗

相关问题 更多 >