用于管理字符串文本等项的转义字符的正则表达式

2024-09-28 20:48:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望能够将字符串文本与转义引号选项相匹配。 例如,我希望能够搜索“this is a‘test with escape d’”值“ok”,并让它正确地将反斜杠识别为转义字符。我尝试过如下解决方案:

import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar'  End. """)
print match.groups() 
# I want ("Foo \' Bar") to be printed above

在看了这个之后,有一个简单的问题,即使用的转义字符“\”本身无法转义。我不知道怎么做。我想要一个像下面这样的解决方案,但是否定的lookbehind断言需要固定长度:

# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...

有谁能解决这个问题?谢谢。


Tags: 字符串test文本refooismatch选项
3条回答

我认为这是可行的:

import re
regexc = re.compile(r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'")

def check(test, base, target):
    match = regexc.search(base)
    assert match is not None, test+": regex didn't match for "+base
    assert match.group(1) == target, test+": "+target+" not found in "+base
    print "test %s passed"%test

check("Empty","''","")
check("single escape1", r""" Example: 'Foo \' Bar'  End. """,r"Foo \' Bar")
check("single escape2", r"""'\''""",r"\'")
check("double escape",r""" Example2: 'Foo \\' End. """,r"Foo \\")
check("First quote escaped",r"not matched\''a'","a")
check("First quote escaped beginning",r"\''a'","a")

正则表达式r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"只向前匹配字符串中所需的内容:

  1. 不是反斜杠或引号的字符。
  2. 转义引号
  3. 转义反斜杠

编辑:

在前面添加额外的正则表达式以检查转义的第一个引号。

单引号=r“'[^'\\]*(?:\\.[^'\\]*)*'"

首先要注意,MizardX的答案是100%准确的。我想补充一些关于效率的建议。其次,我想指出的是,这个问题在很久以前就已经得到了解决和优化—请参见:Mastering Regular Expressions (3rd Edition)(它非常详细地介绍了这个特定的问题—高度推荐的)。

首先让我们看看子表达式,以匹配可能包含转义单引号的单引号字符串。如果你打算允许转义单引号,你最好至少也允许转义转义(这就是道格拉斯·利德的答案)。但只要你坚持下去,就很容易让别人逃脱。有这些要求。米扎德是唯一一个表达正确的人。这里它有短格式和长格式(我冒昧地用VERBOSE模式编写了这篇文章,其中有很多描述性注释——对于非平凡的正则表达式,您应该总是这样做:

# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\\.|[^\\'])*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # Capture group $1: Contents.
      (?:       # Group for contents alternatives
        \\.     # Either escaped anything
      | [^\\']  # or one non-quote, non-escape.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

这可以工作,并正确匹配以下所有字符串测试用例:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"

好吧,现在让我们开始改进这个。首先,备选方案的顺序会有所不同,人们应该总是把最有可能的备选方案放在首位。在这种情况下,非转义字符比转义字符更可能出现,因此颠倒顺序将稍微提高regex的效率,如下所示:

# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\\']|\\.)*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # $1: Contents.
      (?:       # Group for contents alternatives
        [^\\']  # Either a non-quote, non-escape,
      | \\.     # or an escaped anything.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

“展开循环”:

这稍微好一点,但是可以通过应用Jeffrey Friedl的“展开循环”效率技术(从{a2})进一步改进(显著)。上面的正则表达式不是最优的,因为它必须费心地将星量词应用于两个可选的非捕获组,每个可选的组一次只消耗一个或两个字符。这种交替可以通过认识到一个相似的模式反复出现而完全消除,并且可以构造一个等价的表达式来做相同的事情而无需交替。下面是一个优化表达式,用于匹配单引号字符串并将其内容捕获到组$1

# Better regex to match single quoted string:
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
re_sq_long = r"""
    '            # Literal opening quote
    (            # $1: Contents.
      [^'\\]*    # {normal*} Zero or more non-', non-escapes.
      (?:        # Group for {(special normal*)*} construct.
        \\.      # {special} Escaped anything.
        [^'\\]*  # More {normal*}.
      )*         # Finish up {(special normal*)*} construct.
    )            # End $1: Contents.
    '
    """

这个表达式将所有非引号、非反斜杠(大多数字符串的绝大多数)一饮而尽,这大大减少了regex引擎必须执行的工作量。你问得好多少?好吧,我把这个问题中出现的每个正则表达式都输入到RegexBuddy中,并测量正则表达式引擎完成以下字符串匹配(所有解决方案都正确匹配)所需的步骤:

'This is an example string which contains one \'internally quoted\' string.'

以下是上述测试字符串的基准测试结果:

r"""
AUTHOR            SINGLE-QUOTE REGEX   STEPS TO: MATCH  NON-MATCH
Evan Fosmark      '(.*?)(?<!\\)'                  374     376
Douglas Leeder    '(([^\\']|\\'|\\\\)*)'          154     444
cletus/PEZ        '((?:\\'|[^'])*)(?<!\\)'        223     527
MizardX           '((?:\\.|[^\\'])*)'             221     369
MizardX(improved) '((?:[^\\']|\\.)*)'             153     369
Jeffrey Friedl    '([^\\']*(?:\\.[^\\']*)*)'       13      19
"""

这些步骤是使用RegexBuddy调试器函数匹配测试字符串所需的步骤数。“NON-MATCH”列是从测试字符串中移除右引号时声明匹配失败所需的步骤数。如您所见,对于匹配和不匹配的情况,差异都是显著的。还请注意,这些效率改进仅适用于使用回溯的NFA引擎(即Perl、PHP、Java、Python、Javascript、.NET、Ruby和大多数其他引擎)。DFA引擎将看不到这种技术的任何性能提升(请参见:Regular Expression Matching Can Be Simple And Fast)。

关于完整的解决方案:

原始问题(我的解释)的目标是从较大的字符串中挑选单引号的子字符串(可能包含转义引号)。如果知道被引用子字符串之外的文本永远不会包含转义单引号,则上面的正则表达式将执行此操作。然而,要正确地匹配文本海洋中的单引号子字符串,该字符串与转义引号、转义转义符和转义任何else(这是我对作者追求的东西的解释),需要从字符串的开头进行解析否(这是我最初的想法),但它不是-这可以通过使用MizardX非常聪明的(?<!\\)(?:\\\\)*表达式来实现。下面是一些测试字符串,用于练习各种解决方案:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"
test07 = r"out1           'str1' out2 'str2' out2"
test08 = r"out1 \'        'str1' out2 'str2' out2"
test09 = r"out1 \\\'      'str1' out2 'str2' out2"
test10 = r"out1 \\        'str1' out2 'str2' out2"
test11 = r"out1 \\\\      'str1' out2 'str2' out2"
test12 = r"out1         \\'str1' out2 'str2' out2"
test13 = r"out1       \\\\'str1' out2 'str2' out2"
test14 = r"out1           'str1''str2''str3' out2"

给定这个测试数据,让我们看看各种解决方案的运行情况('p'==通过,'XX'==失败):

r"""
AUTHOR/REGEX     01  02  03  04  05  06  07  08  09  10  11  12  13  14
Douglas Leeder    p   p  XX   p   p   p   p   p   p   p   p  XX  XX  XX
  r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
cletus/PEZ        p   p   p   p   p  XX   p   p   p   p   p  XX  XX  XX
  r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'"
MizardX           p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'"
ridgerunner       p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'"
"""

工作测试脚本:

import re
data_list = [
    r"out1 'escaped-escape:        \\ ' out2",
    r"out1 'escaped-quote:         \' ' out2",
    r"out1 'escaped-anything:      \X ' out2",
    r"out1 'two escaped escapes: \\\\ ' out2",
    r"out1 'escaped-quote at end:   \'' out2",
    r"out1 'escaped-escape at end:  \\' out2",
    r"out1           'str1' out2 'str2' out2",
    r"out1 \'        'str1' out2 'str2' out2",
    r"out1 \\\'      'str1' out2 'str2' out2",
    r"out1 \\        'str1' out2 'str2' out2",
    r"out1 \\\\      'str1' out2 'str2' out2",
    r"out1         \\'str1' out2 'str2' out2",
    r"out1       \\\\'str1' out2 'str2' out2",
    r"out1           'str1''str2''str3' out2",
    ]

regex = re.compile(
    r"""(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'""",
    re.DOTALL)

data_cnt = 0
for data in data_list:
    data_cnt += 1
    print ("\nData string %d" % (data_cnt))
    m_cnt = 0
    for match in regex.finditer(data):
        m_cnt += 1
        if (match.group(1)):
            print("  quoted sub-string%3d = \"%s\"" %
                (m_cnt, match.group(1)))

呸!

p.s.感谢MizardX非常酷的(?<!\\)(?:\\\\)*表达式。每天都学点新东西!

道格拉斯·莱德的模式((?:^|[^\\])'(([^\\']|\\'|\\\\)*)')将无法匹配"test 'test \x3F test' test""test \\'test' test"。(包含除引号和反斜杠以外的转义符的字符串,以及前面有转义反斜杠的字符串。)

克莱特斯的模式((?<!\\)'((?:\\'|[^'])*)(?<!\\)')将无法匹配"test 'test\\' test"。(以转义反斜杠结尾的字符串。)

我对单引号字符串的建议是:

(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'

对于单引号或双引号的刺,您可以使用:

(?<!\\)(?:\\\\)*("|')((?:\\.|(?!\1)[^\\])*)\1

使用Python进行测试运行:

Doublas Leeder´s test cases:
"''" matched successfully: ""
" Example: 'Foo \' Bar'  End. " matched successfully: "Foo \' Bar"
"'\''" matched successfully: "\'"
" Example2: 'Foo \\' End. " matched successfully: "Foo \\"
"not matched\''a'" matched successfully: "a"
"\''a'" matched successfully: "a"

cletus´ test cases:
"'testing 123'" matched successfully: "testing 123"
"'testing 123\\'" matched successfully: "testing 123\\"
"'testing 123" didn´t match, as exected.
"blah 'testing 123" didn´t match, as exected.
"blah 'testing 123'" matched successfully: "testing 123"
"blah 'testing 123' foo" matched successfully: "testing 123"
"this 'is a \' test'" matched successfully: "is a \' test"
"another \' test 'testing \' 123' \' blah" matched successfully: "testing \' 123"

MizardX´s test cases:
"test 'test \x3F test' test" matched successfully: "test \x3F test"
"test \\'test' test" matched successfully: "test"
"test 'test\\' test" matched successfully: "test\\"

相关问题 更多 >