<p>我尝试遍历一个文件中的每个特性(每行1个),并根据第二个文件中该行的一列查找所有匹配的特性。我有这个解决方案,在小文件上做我想要的,但在大文件上非常慢(我的文件有20000000行)。<a href="https://gist.github.com/ethanagbaker/dc7bc62a413cbbc32bbf944125845afd" rel="nofollow">Here's a sample of the two input files.</a></p>
<p>我的(慢)代码:</p>
<pre><code>FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
with open(str(FEATUREFILE),'r') as peakFile, open('featureConservation.td',"w+") as outfile:
for line in peakFile.readlines():
chrom = line.split('\t')[0]
startPos = int(line.split('\t')[1])
endPos = int(line.split('\t')[2])
peakName = line.split('\t')[3]
enrichVal = float(line.split('\t')[4])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
if startPos > 0:
with open(str(CONSERVATIONFILEDIR) + str(chrom)+'.bed','r') as conservationFile:
cumulConserv = 0.
n = 0
for conservLine in conservationFile.readlines():
position = int(conservLine.split('\t')[1])
conservScore = float(conservLine.split('\t')[3])
if position >= startPos and position <= endPos:
cumulConserv += conservScore
n+=1
featureConservation = cumulConserv/(n)
outfile.write(str(chrom) + '\t' + str(startPos) + '\t' + str(endPos) + '\t' + str(peakName) + '\t' + str(enrichVal) + '\t' + str(featureConservation) + '\n')
</code></pre>