Python:计算指定窗口中的coorindate数

2024-10-01 19:20:04 发布

您现在位置：Python中文网/ 问答频道 /正文

8186

网友

男 | 程序猿一只，喜欢编程写python代码。

我有两个制表符分隔的文件。每个都有两列，一列是染色体，一列是染色体。我想确定file2中位于file1中位置的指定窗口范围内的位置数，然后签入下一个窗口，以此类推

如果这是文件1的第一行：

1 10500

这是文件2中的：

如果窗口大小为1000，则文件2中的以下坐标位于该窗口内：

1 10177
1 10352
1 10616

第二个窗口的大小应为2000，因此这些坐标位于该窗口内：

我的文件是由染色体分裂，所以这需要考虑到也

我首先创建以下函数：

#Function counts number of variants in a window of specified size around a specified
#genomic position
def vCount(nsPos, df, windowSize):
    #If the window minimum is below 0, set to 0
    if(nsPos - (windowSize/2) < 0): 
        low = 0
    else:
        low = nsPos - (windowSize/2)

    high = high = nsPos + (windowSize/2)
    #calculate the length (i.e. the number) of variants in the subsetted data.
    diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])

    return(diversity)

这将计算文件1中单个坐标的数字。但是，我要做的是计算每个坐标的数字，然后计算平均值。为此，我使用以下函数，它使用上述函数：

#Function to apply vCount function across the entire positional column of df
def windowAvg(NSdf, variantsDF, window):
    x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
    return(x)

这也很好，但正如我上面提到的，这需要在多个基因组窗口中完成，然后必须绘制平均值。因此，我创建了一个使用循环的函数，然后绘制结果：

def loopNplot(NSdf, variantsDF, window):
    #store window
    windows = list(range(1,101))
    #store diversity
    diversity = []

    #Loop through windows appending to the list.
    for i in range(window, 101000, window):

        #Create a list to hold counts for each chromosome (which we then take the mean of)
        tempList = []

        #Loop through chromosomes
        for j in NSdf['CHROM']:
            #Subset NS polymorphisms
            subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
            #Subset all variants
            subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()

            x = windowAvg(subDF, subVar, i)
            #convert series to list
            x = x.tolist()
            #extend templist to include x
            tempList.extend(x)

        #Append mean of tempList - counts of diversity - to list.
        diversity.append(sum(tempList)/len(tempList))


    #Copy diversity
    divCopy = list(diversity)
    #Add a new first index of 0
    diversity = [0] + diversity
    #Remove last index
    diversity = diversity[:-1]
    #x - diversity to give number of variants in each window
    div = list(map(operator.sub, divCopy, diversity))

    plt.scatter(windows, div)

    plt.show()

这里的问题是file1有11609行，file2有13758644行。运行上面的函数是非常缓慢的，我想知道是否有任何方法来优化或更改脚本，使事情更有效

如果Python不是实现这一点的最佳方法，我非常乐意使用其他命令行工具

Tags：文件 of the to 函数 in df window

0条回答

目前没有回答

Python:计算指定窗口中的coorindate数

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python:计算指定窗口中的coorindate数

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >