用于匹配引号组合之间的任何内容的正则表达式

import collections import re Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column']) def tokenize(code): token_specification = [ ('BOTH', r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes ('SINGLE', r"('''.*?''')"), # triple-single quotes ('DOUBLE', r'(""".*?""")'), # triple-double quotes # regexes which match OK ('COM', r'#.*'), ('NEWLINE', r'\n'), # Line endings ('SKIP', r'[ \t]+'), # Skip over spaces and tabs ('MISMATCH',r'.'), # Any other character ] test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE'] tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) line_num = 1 line_start = 0 for mo in re.finditer(tok_regex, code): kind = mo.lastgroup value = mo.group(kind) if kind == 'NEWLINE': line_start = mo.end() line_num += 1 elif kind == 'SKIP': pass elif kind == 'MISMATCH': pass else: if kind in test_regexes: print(kind, value) column = mo.start() - line_start yield Token(kind, value, line_num, column) f = r'C:\path_to_python_file_with_examples_to_match' with open(f) as sfile: content = sfile.read() for t in tokenize(content): pass #print(t)

1条回答

网友

1楼 · 发布于 2024-06-30 16:23:55

您可能缺少使.与换行符匹配的标志

re.finditer(tok_regex, code, flags = re.DOTALL)

在这种情况下，输出是

^{pr2}$

COM现在匹配的方式太多了，因为.现在将所有内容都放到文件末尾。如果我们稍微修改一下这个模式，让它不那么贪婪

^{3}$

我们现在可以使用re.MULTILINE来减少匹配

re.finditer(tok_regex, code, flags = re.DOTALL | re.MULTILINE)

现在的输出是

('BOTH', '"""\n    This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n    """')
('COM', '# sdasda fad fhs ghf dfh')
('BOTH', '\'\'\'blah qsdkfjqsv,;sv\n                   vq\xc3\xb9lvnq\xc3\xb9v \n                   dqvnq\n                   vq\n                   v\n\nblah blah\'8&^"\'\'\'')
('BOTH', '\'\'\'blah blah\n     blah\n    \'8&^"\'\'\'')

如果您确实不想使用标志，那么可以使用一种“hack”来不使用.，因为这个元字符几乎匹配所有内容，除了换行符。您可以创建一个匹配组，它将匹配除一个符号之外的所有内容，该符号不太可能出现在要解析的文件中。例如，可以将字符与ASCII代码0一起使用。这种字符的Regex将是\x00，对应的模式[^\x00]将匹配每个符号（甚至换行符），除了ASCII代码为0的符号（这就是为什么它是一个黑客，你不能匹配没有标志的每个符号）。您需要为COM保留初始regex，而对于BOTH则需要保留

('BOTH',      r'([\'"]{3})[^\x00]*?\2')

强烈建议使用解释regex的在线工具，如regex101

对于更复杂的引号匹配情况，您需要编写一个解析器。例如，请参见thisCan the csv format be defined by a regex?和thisWhen you should NOT use Regular Expressions?。在

相关问题更多 >

编程相关推荐

热门问题

热门文章