Python2.4.3
我需要读一些文件(可以大到10GB)。我需要它做的是遍历文件,直到它匹配一个模式。然后打印该行及其后的每一行,直到它与另一个模式匹配为止。这时,继续读取文件,直到下一个模式匹配为止。
例如。文件包含。
---- Alpha ---- Zeta
...(text lines)
---- Bravo ---- Delta
...(text lines)
等等
如果匹配——阿尔法——泽塔,它应该打印——阿尔法——泽塔以及之后的每一行,直到它遇到——布拉沃——德尔塔(或者除了——阿尔法——泽塔以外的任何东西),它会一直读到匹配——阿尔法——泽塔为止。
下面匹配我要查找的内容-但只打印匹配的行-而不是后面的文本。
你知道我在这件事上做错了什么吗?
import re
fh = open('text.txt', 'r')
re1='(-)' # Any Single Character 1
re2='(-)' # Any Single Character 2
re3='(-)' # Any Single Character 3
re4='(-)' # Any Single Character 4
re5='( )' # White Space 1
re6='(Alpha)' # Word 1
re6a='((?:[a-z][a-z]+))' # Word 1 alternate
re7='( )' # White Space 2
re8='(-)' # Any Single Character 5
re9='(-)' # Any Single Character 6
re10='(-)' # Any Single Character 7
re11='(-)' # Any Single Character 8
re12='(\\s+)' # White Space 3
re13='(Zeta)' # Word 2
re13a='((?:[a-z][a-z]+))' # Word 2 alternate
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL)
rga = re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL)
for line in fh:
if re.match(rg, line):
print line
fh.next()
while not re.match(rga, line):
print fh.next()
fh.close()
以及我的示例文本文件。
---- Pappa ---- Oscar
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris eleifend imperdiet
lacus quis imperdiet. Nulla erat neque, laoreet vel fermentum a, dapibus in sem.
Maecenas elementum nisi nec neque pellentesque ac rutrum urna cursus. Nam non purus
sit amet dolor fringilla venenatis. Integer augue neque, scelerisque ac dictum at,
venenatis elementum libero. Etiam nec ante in augue porttitor laoreet. Aenean ultrices
pellentesque erat, id porta nulla vehicula id. Cras eu ante nec diam dapibus hendrerit
in ac diam. Vivamus velit erat, tincidunt id tempus vitae, tempor vel leo. Donec
aliquam nibh mi, non dignissim justo.
---- Alpha ---- Zeta
Sed molestie tincidunt euismod. Morbi ultrices diam a nibh varius congue. Nulla velit
erat, luctus ac ornare vitae, pharetra quis felis. Sed diam orci, accumsan eget
commodo eu, posuere sed mi. Phasellus non leo erat. Mauris turpis ipsum, mollis sed
ismod nec, aliquam non quam. Vestibulum sem eros, euismod ut pharetra sit amet,
dignissim eget leo.
---- Charley ---- Oscar
Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
Aliquam commodo, metus at vulputate hendrerit, dui justo tempor dui, at posuere
ante vitae lorem. Fusce rutrum nibh a erat condimentum laoreet. Nullam eu hendrerit
sapien. Suspendisse id lobortis urna. Maecenas ut suscipit nisi. Proin et metus at
urna euismod sollicitudin eu at mi. Aliquam ac egestas magna. Quisque ac vestibulum
lectus. Duis ac libero magna, et volutpat odio. Cras mollis tincidunt nibh vel rutrum.
Curabitur fringilla, ante eget scelerisque rhoncus, libero nisl porta leo, ac
vulputate mi erat vitae felis. Praesent auctor fringilla rutrum. Aenean sapien ligula,
imperdiet sodales ullamcorper ut, vulputate at enim.
---- Bravo ---- Delta
Donec cursus tincidunt pellentesque. Maecenas neque nisi, dignissim ac aliquet ac,
vestibulum ut tortor. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Aenean ullamcorper dapibus accumsan. Aenean eros
tortor, ultrices at adipiscing sed, lobortis nec dolor. Fusce eros ligula, posuere
quis porta nec, rhoncus et leo. Curabitur turpis nunc, accumsan posuere pulvinar eget,
sollicitudin eget ipsum. Sed a nibh ac est porta sollicitudin. Pellentesque ut urna ut
risus pharetra mollis tincidunt sit amet sapien. Sed semper sollicitudin eros quis
pellentesque. Curabitur ac metus lorem, ac malesuada ipsum. Nulla turpis erat, congue
eu gravida nec, egestas id nisi. Praesent tellus ligula, pretium vitae ullamcorper
vitae, gravida eu ipsum. Cras sed erat ligula.
---- Alpha ---- Zeta
Cras id condimentum lectus. Sed sit amet odio eros, ut mollis sapien. Etiam varius
tincidunt quam nec mattis. Nunc eu varius magna. Maecenas id ante nisl. Cras sed augue
ipsum, non mollis velit. Fusce eu urna id justo sagittis laoreet non id urna. Nullam
venenatis tincidunt gravida. Proin mattis est sit amet dolor malesuada sagittis.
Curabitur in lacus rhoncus mi posuere ullamcorper. Phasellus eget odio libero, ut
lacinia orci. Pellentesque iaculis, ligula at varius vulputate, arcu leo dignissim
massa, non adipiscing lectus magna nec dolor. Quisque in libero nec orci vestibulum
dapibus. Nulla turpis massa, varius quis gravida eu, bibendum et nisl. Fusce tincidunt
laoreet elit, sed egestas diam pharetra eget. Maecenas lacus velit, egestas nec tempor
eget, hendrerit et massa.
+++++++++++++++++++++++++++++++更新+++++++++++++++++++++++++++++++++++++++++
下面的代码起作用了-它匹配头类型行-打印该行及其后的每一行,直到下一个头类型模式(即不匹配)跳过,直到下一个头类型模式。
唯一的问题是-它真的很慢。穿过10米的线路大约需要一分钟。
re1='(-)' # Any Single Character 1
re2='(-)' # Any Single Character 2
re3='(-)' # Any Single Character 3
re4='(-)' # Any Single Character 4
re5='( )' # White Space 1
re6='(Alpha)' # Word 1
re6a='((?:[a-z][a-z]+))' # Word 1 alternate
re7='( )' # White Space 2
re8='(-)' # Any Single Character 5
re9='(-)' # Any Single Character 6
re10='(-)' # Any Single Character 7
re11='(-)' # Any Single Character 8
re12='(\\s+)' # White Space 3
re13='(Zeta)' # Word 2
re13a='((?:[a-z][a-z]+))' # Word 2 alternate
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL)
rga = re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL)
linestop = 0
fh = open('test.txt', 'r')
for line in fh:
if linestop == 0:
if re.match(rg, line):
print line
linestop = 1
else:
if re.match(rga, line):
linestop = 0
else:
print line
fh.close()
如果我先给它加上一个grep部分,我想这会大大加快速度。i、 然后运行上面的regex脚本。
我的操作系统工作得很好-我不知道如何通过pOpen传递regex匹配
****最终更新
我称之为完成。我最后做的是:
最终的结果是从大约65秒读取一个1000万行文件(打印出必要的项目)到大约3.5秒。我真希望我能找出除了os.system之外如何传递grep的方法,但也许它在python 2.4中并没有很好地实现
目前没有回答
相关问题 更多 >
编程相关推荐