如何在python中使用regex从文件中提取模式

2024-10-01 02:33:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的输入文件,需要提取以nsub、rcmod、ccomp、acomp开头的单词模式,并打印在两个输出文件中,如下所示,我是python新手,这里不知道如何使用regex

输入文件

nsubj(believe-4, i-1)
aux(believe-4, ca-2)
neg(believe-4, n't-3)
root(ROOT-0, believe-4)
acomp(believe-4, @mistamau-5)
aux(know-8, does-6)
neg(know-8, n't-7)
ccomp(@mistamau-5, know-8)
dobj(is-12, who-9)
amod(tatum-11, channing-10)
nsubj(is-12, tatum-11)
ccomp(know-8, is-12)
root(ROOT-0, What-1)
cop(What-1, is-2)
amod(people-4, worse-3)
xsubj(hear-9, I-5)
aux(talking-7, am-6)
rcmod(people-4, talking-7)
xcomp(talking-7, hear-9)
dobj(hear-9, me-10)
advmod(poorly-12, very-11)

输出文件\u 1

nsubj(believe-4, i-1)
nsubj(is-12, tatum-11)
acomp(believe-4, @mistamau-5)
rcmod(people-4, talking-7)
ccomp(know-8, is-12)
ccomp(@mistamau-5, know-8)

输出文件2

believe, i
is, tatum
believe, @mistamau
people, talking
know, is
@mistamau, know

Tags: 文件ispeopleknowauxheartalkingneg
2条回答
regex = re.compile(r"""
    ^          # Start of line (re.M modifier set!)
    (          # Start of capturing group 1:
     (?:nsubj|rcmod|ccomp|acomp) # Match one of these
     \(        # Match (
     ([^-]*)   # Match and capture in group 2 any no. of non-dash characters
     -\d+,[ ]  # Match a dash and a number, a comma and a space
     ([^-]*)   # Match and capture in group 3 any no. of non-dash characters
     -\d+      # Match a dash and a number
     \)        # Match )
    )          # End of group 1""", re.M|re.X)

如果我能正确理解你的要求就行了。你知道吗

当应用于整个文件(s = myfile.read())时,会得到以下结果:

>>> regex.findall(s)
[('nsubj(believe-4, i-1)', 'believe', 'i'), 
 ('acomp(believe-4, @mistamau-5)', 'believe', '@mistamau'), 
 ('ccomp(@mistamau-5, know-8)', '@mistamau', 'know'), 
 ('nsubj(is-12, tatum-11)', 'is', 'tatum'), 
 ('ccomp(know-8, is-12)', 'know', 'is'), 
 ('rcmod(people-4, talking-7)', 'people', 'talking')]

这里有一个程序,它从stdin中提取单词并打印“matched”或“not matched”,这取决于单词是以“Big”还是“Daddy”开头。你知道吗

import re
import sys
prog = re.compile('((Big)|(Daddy))[a-z]*')
while True:
    line = sys.stdin.readline()
    if not line: break
    if prog.match(line):
        print 'matched'
    else:
        print 'not matched'

只需将正则表达式模式替换为您自己的模式和来自文件的输入,而不是中的标准模式,您就应该设置为~。你知道吗

相关问题 更多 >