我有两个制表符分隔的文件。每个都有两列,一列是染色体,一列是染色体。我想确定file2中位于file1中位置的指定窗口范围内的位置数,然后签入下一个窗口,以此类推
如果这是文件1的第一行:
1 10500
这是文件2中的:
1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
如果窗口大小为1000,则文件2中的以下坐标位于该窗口内:
1 10177
1 10352
1 10616
第二个窗口的大小应为2000,因此这些坐标位于该窗口内:
1 10177
1 10352
1 10616
1 11008
1 11012
我的文件是由染色体分裂,所以这需要考虑到也
我首先创建以下函数:
#Function counts number of variants in a window of specified size around a specified
#genomic position
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)
high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])
return(diversity)
这将计算文件1中单个坐标的数字。但是,我要做的是计算每个坐标的数字,然后计算平均值。为此,我使用以下函数,它使用上述函数:
#Function to apply vCount function across the entire positional column of df
def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)
这也很好,但正如我上面提到的,这需要在多个基因组窗口中完成,然后必须绘制平均值。因此,我创建了一个使用循环的函数,然后绘制结果:
def loopNplot(NSdf, variantsDF, window):
#store window
windows = list(range(1,101))
#store diversity
diversity = []
#Loop through windows appending to the list.
for i in range(window, 101000, window):
#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []
#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset NS polymorphisms
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()
x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)
#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))
#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))
plt.scatter(windows, div)
plt.show()
这里的问题是file1有11609行,file2有13758644行。运行上面的函数是非常缓慢的,我想知道是否有任何方法来优化或更改脚本,使事情更有效
如果Python不是实现这一点的最佳方法,我非常乐意使用其他命令行工具
目前没有回答
相关问题 更多 >
编程相关推荐