有没有更快的方法在两个数组中找到匹配的特性（Python）？

FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed' CONSERVATIONFILEDIR = './conservation/' with open(str(FEATUREFILE),'r') as peakFile, open('featureConservation.td',"w+") as outfile: for line in peakFile.readlines(): chrom = line.split('\t')[0] startPos = int(line.split('\t')[1]) endPos = int(line.split('\t')[2]) peakName = line.split('\t')[3] enrichVal = float(line.split('\t')[4]) #Reject negative peak starts, if they exist (sometimes this can happen w/ MACS) if startPos > 0: with open(str(CONSERVATIONFILEDIR) + str(chrom)+'.bed','r') as conservationFile: cumulConserv = 0. n = 0 for conservLine in conservationFile.readlines(): position = int(conservLine.split('\t')[1]) conservScore = float(conservLine.split('\t')[3]) if position >= startPos and position <= endPos: cumulConserv += conservScore n+=1 featureConservation = cumulConserv/(n) outfile.write(str(chrom) + '\t' + str(startPos) + '\t' + str(endPos) + '\t' + str(peakName) + '\t' + str(enrichVal) + '\t' + str(featureConservation) + '\n')

3条回答

网友

1楼 · 编辑于 2024-10-01 09:42:08

首先，每次从peakFile读取一行时，您都会遍历所有的conservationFile，因此在if语句中的n+=1之后插入一个break，这应该会有所帮助。假设只有一个匹配。你知道吗

另一种选择是尝试使用mmap，这可能有助于缓冲

网友

2楼 · 编辑于 2024-10-01 09:42:08

Bedtools就是为此而设计的，特别是intersect函数：

http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

网友

3楼 · 编辑于 2024-10-01 09:42:08

对于我来说，最好的解决方案似乎是为熊猫重写上面的代码。以下是在一些非常大的文件上对我有效的方法：

from __future__ import division
import pandas as pd

FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'

peakDF = pd.read_csv(str(FEATUREFILE), sep = '\t', header=None, names=['chrom','start','end','name','enrichmentVal'])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
peakDF.drop(peakDF[peakDF.start <= 0].index, inplace=True)
peakDF.reset_index(inplace=True)
peakDF.drop('index', axis=1, inplace=True)
peakDF['conservation'] = 1.0 #placeholder

chromNames = peakDF.chrom.unique()

for chromosome in chromNames: 
    chromSubset = peakDF[peakDF.chrom == str(chromosome)]
    chromDF = pd.read_csv(str(CONSERVATIONFILEDIR) + str(chromosome)+'.bed', sep='\t', header=None, names=['chrom','start','end','conserveScore'])

for i in xrange(0,len(chromSubset.index)):
    x = chromDF[chromDF.start >= chromSubset['start'][chromSubset.index[i]]]
    featureSubset = x[x.start < chromSubset['end'][chromSubset.index[i]]]
    x=None
    featureConservation = float(sum(featureSubset.conserveScore)/(chromSubset['end'][chromSubset.index[i]]-chromSubset['start'][chromSubset.index[i]]))
    peakDF.set_value(chromSubset.index[i],'conservation',featureConservation)
    featureSubset=None

 peakDF.to_csv("featureConservation.td", sep = '\t')

相关问题更多 >

编程相关推荐

热门问题

热门文章