Python:计算指定窗口中的coorindate数

2024-10-01 19:20:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个制表符分隔的文件。每个都有两列,一列是染色体,一列是染色体。我想确定file2中位于file1中位置的指定窗口范围内的位置数,然后签入下一个窗口,以此类推

如果这是文件1的第一行:

1 10500

这是文件2中的:

1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550

如果窗口大小为1000,则文件2中的以下坐标位于该窗口内:

1 10177
1 10352
1 10616

第二个窗口的大小应为2000,因此这些坐标位于该窗口内:

1 10177
1 10352
1 10616
1 11008
1 11012

我的文件是由染色体分裂,所以这需要考虑到也

我首先创建以下函数:

#Function counts number of variants in a window of specified size around a specified
#genomic position
def vCount(nsPos, df, windowSize):
    #If the window minimum is below 0, set to 0
    if(nsPos - (windowSize/2) < 0): 
        low = 0
    else:
        low = nsPos - (windowSize/2)

    high = high = nsPos + (windowSize/2)
    #calculate the length (i.e. the number) of variants in the subsetted data.
    diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])

    return(diversity)

这将计算文件1中单个坐标的数字。但是,我要做的是计算每个坐标的数字,然后计算平均值。为此,我使用以下函数,它使用上述函数:

#Function to apply vCount function across the entire positional column of df
def windowAvg(NSdf, variantsDF, window):
    x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
    return(x)

这也很好,但正如我上面提到的,这需要在多个基因组窗口中完成,然后必须绘制平均值。因此,我创建了一个使用循环的函数,然后绘制结果:

def loopNplot(NSdf, variantsDF, window):
    #store window
    windows = list(range(1,101))
    #store diversity
    diversity = []

    #Loop through windows appending to the list.
    for i in range(window, 101000, window):

        #Create a list to hold counts for each chromosome (which we then take the mean of)
        tempList = []

        #Loop through chromosomes
        for j in NSdf['CHROM']:
            #Subset NS polymorphisms
            subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
            #Subset all variants
            subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()

            x = windowAvg(subDF, subVar, i)
            #convert series to list
            x = x.tolist()
            #extend templist to include x
            tempList.extend(x)

        #Append mean of tempList - counts of diversity - to list.
        diversity.append(sum(tempList)/len(tempList))


    #Copy diversity
    divCopy = list(diversity)
    #Add a new first index of 0
    diversity = [0] + diversity
    #Remove last index
    diversity = diversity[:-1]
    #x - diversity to give number of variants in each window
    div = list(map(operator.sub, divCopy, diversity))

    plt.scatter(windows, div)

    plt.show()

这里的问题是file1有11609行,file2有13758644行。运行上面的函数是非常缓慢的,我想知道是否有任何方法来优化或更改脚本,使事情更有效

如果Python不是实现这一点的最佳方法,我非常乐意使用其他命令行工具


Tags: 文件oftheto函数indfwindow

热门问题