在几个列表中的文本文件中查找模式？

import re noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog'] CC_list = ['and', 'or'] noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b' with open('test_sentence.txt', 'r') as input_f: read_input = input_f.read() word = re.findall(noun_list_pattern1, read_input) for w in word: print w else: pass

2条回答

网友

1楼 · 编辑于 2024-05-19 22:11:45

实际上，您不一定需要正则表达式，因为有许多方法可以使用原始列表来完成此操作。在

noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']

#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
    matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
    if len(matches) == 4:
        for match in matches:
            print match

匹配号为4的原因是4是正确的匹配数。（注意，重复名词或连词也可能是这种情况）。

编辑：

这个版本打印匹配的行和匹配的单词。还修复了可能的多单词匹配问题：

^{pr2}$

但是，如果这不适合您，您可以始终按如下方式构建regex（使用itertools模块）：

#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
    matches = [noun for noun in nouns]
    matches.append(conj)
    #matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
    regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
    print regex_string
    #... do regex related matching here

这个方法的警告是，它是纯暴力的，因为它生成两个列表的所有可能的组合（读取排列），然后可以测试每一行是否匹配。因此，它的速度非常慢，但是在这个例子中，如果与给定的匹配（连词前的非逗号），这个将生成完全匹配的结果。在

根据需要进行调整。在

网友

2楼 · 编辑于 2024-05-19 22:11:45

把你的问题分解一下。首先，你需要一个模式来匹配你列表中的单词，而不是其他的。您可以使用交替运算符|和字面单词来实现这一点。^例如，{}将匹配"red"、"green"、或{}，但不是{}。将名词列表与该字符连接，并添加单词边界元字符和括号，以将交替项分组：

noun_patt = r'\b(' + '|'.join(nouns) + r')\b'

对连词列表执行相同的操作：

^{pr2}$

您要进行的总体匹配是“一个或多个noun_patt匹配，每个匹配后面都有一个逗号，后面跟一个匹配的conj_patt，然后再匹配一个noun_patt匹配”。对于正则表达式来说很简单：

patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)

实际上，您并不想使用re.findall()，而是使用re.search()，因为每行只需要一个匹配项：

for line in lines:
...     print re.search(patt, line).group(0)
... 
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs

请注意，您已经接近（如果没有碰到）正则表达式在解析英语方面的限制。任何比这更复杂的，您将需要研究实际的解析，也许使用NLTK。在

相关问题更多 >

编程相关推荐

热门问题

热门文章