<p>拆分和检查长度可能仍然比正则表达式快:</p>
<pre><code>for line in f:
spl = line.split("|",2)
if len(spl) > 2:
print(spl[1])
....
</code></pre>
<p>匹配和非匹配行上的一些计时:</p>
^{pr2}$
<p>您可以通过创建一个本地引用结构分裂公司名称:</p>
<pre><code>_spl = str.split
for line in f:
spl = _spl(s,"|",2)
if len(spl) > 2:
.....
</code></pre>
<p>由于每条管线中的管道数量始终相同:</p>
<pre><code>def main(argv):
seen = set() # only use if you actually need a set of all names
with open("test.txt", 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
v = row[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
</code></pre>
<p>如果/else在您附加到文件时显得多余,不管怎样,如果您想保留一组行[1]的另一个原因,您可以每次都添加到该集合中,除非您确实想要一组所有的名称,否则我会从代码中删除它。</p>
<p>应用相同的逻辑进行拆分:</p>
<pre><code>def main(argv):
seen = set()
with open("test.txt", 'r') as infile:
_spl = str.split
for row in infile:
v = _spl(row,"|",2)[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
</code></pre>
<p>会导致大量开销的是不断地打开和写入,但是除非您可以将所有行存储在内存中,否则没有简单的方法来绕过它。</p>
<p>就阅读而言,在一个1000万行的文件中,只需拆分两倍就可以比csv阅读器表现出色:</p>
<pre><code>In [15]: with open("in.txt") as f:
....: print(sum(1 for _ in f))
....:
10000000
In [16]: paste
def main(argv):
with open(argv, 'r') as infile:
for row in infile:
v = row.split("|", 2)[1]
if v:
pass
## -- End pasted text --
In [17]: paste
def main_r(argv):
with open(argv, 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
if row[1]:
pass
## -- End pasted text --
In [18]: timeit main("in.txt")
1 loops, best of 3: 3.85 s per loop
In [19]: timeit main_r("in.txt")
1 loops, best of 3: 6.62 s per loop
</code></pre>