
2024-09-28 21:25:07 发布

您现在位置:Python中文网/ 问答频道 /正文

对于文本处理任务,我需要应用多个regex替换(即。回复sub). 有多个带有自定义替换参数的regex模式。结果需要是原始文本、带替换的文本和元组映射,标识源文本中替换字符串的开始索引、结束索引和结果文本中的索引。

例如。 下面是一个包含输入文本和3个修饰符元组数组的示例代码。你知道吗

text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.

modifiers = [
        { 1:lambda x:month(x), 2:lambda x:num2text(x), 3:lambda x:num2text(x) }
        r' (\d) ', 
        { 1:lambda x:num2text(x) }
        { 1: 'culpae' }


[((7, 11), (7, 30)), ((12, 14), (31, 35)), ((20, 22), (41, 51)), ((23, 28), (52, 57)),...]



这是a demo of current state。 单词转换扩展(标准化)函数有意简化为固定值dict映射。你知道吗


我的实现已经处理过的一个常见情况是,至少有两个修饰符a和B,它们必须处理像“text text a text B text a text B”这样的文本,因为第一个修饰符大量输出第二个“a”替换的输出范围变得不正确,因为B修饰符进入并在第二个修饰符之前更改输出文本“a”。你知道吗



Writing a python package称为re-map。 也可以考虑spacy提到的here。你知道吗

Tags: lambda字符串textin文本示例情况修饰符




安装inflectpip install inflect


import re
from datetime import datetime
import inflect

ENGINE = inflect.engine()

def num2words(num):
    """Number to Words using inflect package"""
    return ENGINE.number_to_words(num)

def pretty_format_date(pattern, date_found, text):
    """Pretty format dates"""
    _month, _day, _year = date_found.groups()
    month = datetime.strptime('{day}/{month}/{year}'.format(
        day=_day, month=_month.strip('.'), year=_year
    ), '%d/%b/%Y').strftime('%B')
    day, year = num2words(_day), num2words(_year)
    date = '{month} {day}, {year} '.format(month=month, day=day, year=year)
    begin, end = date_found.span()
    _text = re.sub(pattern, date, text[begin:end])
    text = text[:begin] + _text + text[end:]
    return text, begin, end

def format_date(pattern, text):
    """Format given string into date"""
    spans = []
    # For loop prevents us from going into an infinite loop
    # If there is malformed texts or bad regex
    for _ in re.findall(pattern, text):
        date_found = re.search(pattern, text)
        if not date_found:
            text, begin, end = pretty_format_date(pattern, date_found, text)
            spans.append([begin, end])
        except Exception:
            # Pass without any modification if there is any errors with date formats

    return text, spans

def number_to_words(pattern, text):
    """Numer to Words with spans"""
    spans = []
    # For loop prevents us from going into an infinite loop
    # If there is malformed texts or bad regex
    for _ in re.findall(pattern, text):
        number_found = re.search(pattern, text)
        if not number_found:
        _number = number_found.groups()
        number = num2words(_number)
        begin, end = number_found.span()
        spans.append([begin, end])
        _text = re.sub(pattern, number, text[begin:end])
        text = text[:begin] + ' {} '.format(_text) + text[end:]
    return text, spans

def custom_func(pattern, text, output):
    """Custom function"""
    spans = []
    for _ in re.findall(pattern, text):
        _found = re.search(pattern, text)
        begin, end = _found.span()
        spans.append([begin, end])
        _text = re.sub(pattern, output, text[begin:end])
        text = text[:begin] + ' {} '.format(_text) + text[end:]
    return text, spans

text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.

modifiers = [
        r' (\d) ',
        r'( \bculpa\b)',  # Better using this pattern to catch the exact word

for regex, func in modifiers:
    if not isinstance(func, str):
        print('\n{} {} {}'.format('#' * 20, func.__name__, '#' * 20))
        _text, spans = func(regex, text)
        print('\n{} {} {}'.format('#' * 20, func, '#' * 20))
        _text, spans = custom_func(regex, text, func)
    print(_text, spans)


#################### format_date ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On April six, two thousand and nine  Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit animid est laborum.
 [[128, 142]]

#################### number_to_words ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex five ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt six mollit anim id est laborum.
 [[231, 234], [463, 466]]

#################### culpae ####################

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpae  minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpae  qui officia deserunt 6 mollit anim id est laborum.
 [[150, 156], [435, 441]]



这是a demo。你知道吗

相关问题 更多 >