回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我需要读入一个大的csv数据文件,但是它充满了换行符,而且通常相当混乱。因此,我没有手动操作,但是我遇到了一个奇怪的减速,这似乎取决于文件中出现的字符。你知道吗</p>
<p>当我试图通过随机创建一个看起来很相似的csv文件来重现问题时,我发现问题可能出在<code>count</code>函数中。你知道吗</p>
<p>考虑这个例子,它用混沌随机数据创建一个大文件,读取该文件,然后使用计数顺序将其读取为列数据。你知道吗</p>
<p>注意,在文件的第一次运行中,我只使用<code>string.ascii_letters</code>作为随机数据,在第二次运行中,我使用<code>string.printable</code>中的字符。你知道吗</p>
<pre><code>import os
import random as rd
import string
import time
# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'
# Create file with random data:
outputFolder = "C:\\DataDir\\Output\\"
numberOfCols = 38
fullLength = 10000
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)
# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))
# Restore columns by counting the separator:
linesT = ''
lines2 = []
time0 = time.time()
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
print time.time()-time0
</code></pre>
<p>print语句输出:</p>
<pre><code>0.0
0.0019998550415
0.00100016593933
0.000999927520752
0.000999927520752
0.000999927520752
0.000999927520752
0.00100016593933
0.0019998550415
0.000999927520752
0.00100016593933
0.0019998550415
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
</code></pre>
<p>持续快速的性能。你知道吗</p>
<p>现在我将<code>createRandomString</code>中的第三行改为<code>randstr = ''.join(rd.choice(string.printable) for _ in range(7))</code>,我的输出变成:</p>
<pre><code>0.0
0.0759999752045
0.273000001907
0.519999980927
0.716000080109
0.919999837875
1.11500000954
1.25199985504
1.51200008392
1.72199988365
1.8820002079
2.07999992371
2.21499991417
2.37400007248
2.64800000191
2.81900000572
3.04500007629
3.20299983025
3.55500006676
3.6930000782
3.79499983788
4.13900017738
4.19899988174
4.58700013161
4.81799983978
4.92000007629
5.2009999752
5.40199995041
5.48399996758
5.70299983025
5.92300009727
6.01099991798
6.44200015068
6.58999991417
3.99399995804
</code></pre>
<p>不仅性能非常慢,而且随着时间的推移,性能一直在变慢。你知道吗</p>
<p>唯一的区别在于写入随机数据的字符范围。你知道吗</p>
<p>我的真实数据中出现的完整字符集如下:</p>
<pre><code>charSet = [' ','"','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6',
'7','8','9',':',';','<','=','>','A','B','C','D','E','F','G','H','I','J','K',
'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','\\','_','`','a',
'b','d','e','g','h','i','l','m','n','o','r','s','t','x']
</code></pre>
<p>让我们对<code>count</code>-函数进行一些基准测试:</p>
<pre><code>import random as rd
rd.seed()
def Test0():
randstr = ''.join(rd.choice(string.digits) for _ in range(10000))
randstr.count('7')
def Test1():
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(10000))
randstr.count('a')
def Test2():
randstr = ''.join(rd.choice(string.printable) for _ in range(10000))
randstr.count(';')
def Test3():
randstr = ''.join(rd.choice(charSet) for _ in range(10000))
randstr.count(';')
</code></pre>
<p>我只测试数据中的数字、字母、可打印和字符集。你知道吗</p>
<p><code>%timeit</code>的结果:</p>
<pre><code>%timeit(Test0())
100 loops, best of 3: 9.27 ms per loop
%timeit(Test1())
100 loops, best of 3: 9.12 ms per loop
%timeit(Test2())
100 loops, best of 3: 9.94 ms per loop
%timeit(Test3())
100 loops, best of 3: 8.31 ms per loop
</code></pre>
<p>性能是一致的,并且没有任何关于某些字符集的问题。你知道吗</p>
<p>我还测试了用<code>+</code>连接字符串是否会导致速度减慢,但事实并非如此。你知道吗</p>
<p>谁能解释一下或者给我一些提示吗?你知道吗</p>
<p>编辑:使用Python 2.7.12</p>
<p>编辑2:在我的原始数据中发生了以下情况:</p>
<p>该文件有大约550000行,这些行经常被随机换行符打断,但始终由38<code>";"</code>分隔符定义。直到大概30万行的时候,表现很快,然后从那一行开始,它突然开始变得越来越慢。我正在用新的线索进一步调查。你知道吗</p>