如何在TSV文件中填充缺失的序列行

2024-10-04 09:30:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我仍然是一个初学者,所以对于初学者来说,很抱歉这个问题可能有一个明显的答案,也很抱歉代码混乱,但我有上万行的文件。我正在使用一种特定的窗口框架技术来滑动我的文件,所以我需要确保每个窗口都在那里。但是,我的一些输入文件遗漏了某些行,因此我尝试用Python编写代码来添加这些行和我想要的信息,以使文件完整。代码如下所示:

#!/usr/bin/env python

outfile = open ("missing_test.txt", "w")

with open("add_missing.txt", "r") as file:
    last_line = 0   #This is where it starts for bin 1
    lines = []
    header_line = next(file)
    outfile.write(header_line)
    CHROM = 'BABA_1'
    for line in file:     #go through every line to check its existence and rewrite to new file
        nums = line.split("\t")
        num1 = nums[0]        #no integer because this is a string: name individual
        num2 = int(nums[1])   #integer for window
        num3 = int(nums[2])   #integer for coverage (here always 10000 to met treshold)
        num4 = int(nums[3])   #integer for SNP count   
        if num1 == CHROM:     #
            while num2 != last_line + 10000:
                #A line is missing, so a new line is added with 0 SNPs:
                NUM2 = last_line + 10000   # New window, the one that was missing
                NUM4 = 0   #0 SNPs found
                #lines.append((num1, NUM2, num3, NUM4))
                OUTLINE = "%s\t%s\t%s\t%s" % (num1, NUM2, num3, NUM4) #write new line to outfile       
                outfile.write(OUTLINE + "\n")
                last_line += 10000
            lines.append((num1,num2,num3,num4))
            last_line += 10000    #also add 10000 here otherwise the while loop makes no sense
            outline = "%s\t%s\t%s\t%s" % (num1, num2, num3, num4)
            outfile.write(outline + "\n")   #write all existing lines to outfile

        else:
            CHROM = num1
            last_line = 0

outfile.close()        

因此,只要第一个“CHROM”的第一个窗口等于0,就可以很好地工作,但情况并非总是这样。在后一种情况下,循环将是无限的。例如,输入和期望输出如下所示:

输入:

indiv   window  coverage    SNP
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  80000   10000   1
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 80000   10000   9

期望输出:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9

我一直在努力寻找答案,以得到这个虽然我的循环工作没有无限,但我真的没有看到我的缺陷。有没有人能告诉我怎么解决这个问题?你知道吗

非常感谢您的帮助,提前谢谢!你知道吗


Tags: 文件toforislineoutfilefilewrite
2条回答

您可以使用以下方法,首先构建一个空列表,然后将任何现有条目分配到其中,然后再将它们作为行写入输出:

import csv
import itertools

with open('add_missing.txt', 'rb') as f_input, open('missing_test.txt', 'wb') as f_output:
    csv_input = csv.reader(f_input, delimiter='\t', skipinitialspace=True)
    csv_output = csv.writer(f_output, delimiter='\t')
    csv_output.writerow(next(csv_input))

    for k, g in itertools.groupby(csv_input, lambda x: x[0]):
        empty = [[k, x * 10000, 10000, 0] for x in range(1, 9)]
        for row in g:
            empty[int(row[1]) / 10000 - 1] = row

        csv_output.writerows(empty)   

给你:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9

试着按照这些思路来做:

#!/usr/bin/python

outfile = open ("missing_test.txt", "w")

def write_line(indiv, window, coverage, snp):
    outline = "%s\t%s\t%s\t%s\n" % (indiv, window, coverage, snp)
    outfile.write(outline)

with open("add_missing.txt", "r") as file:
    lines = file.readlines()
    write_line(*lines.pop(0).rstrip().split("\t"))
    first_line = lines[0].split("\t")
    last_indiv = first_line[0]
    last_window = int(first_line[1])

    for line in lines:
        indiv, window, coverage, snp = line.split("\t")
        window = int(window)
        coverage = int(coverage)
        snp = int(snp)

        if indiv == last_indiv:
            # If the current window is higher than expected,
            # insert a line with the missing window.
            # Repeat until we get to the expected window.
            while window > last_window + 10000:
                write_line(indiv, last_window + 10000, coverage, 0)
                last_window += 10000
            last_window = window
        else:
            last_indiv = indiv
            last_window = window
        write_line(indiv, window, coverage, snp)

它没有包含的是某个窗口号在给定的indiv中是第一个的期望,因为您没有定义该行为,而且您对此的评论相当混乱。你知道吗


缺失内容_测试.txt运行此脚本后:

indiv window  coverage    SNP
BABA_1    20000   10000   7
BABA_1    30000   10000   1
BABA_1    40000   10000   0
BABA_1    50000   10000   2
BABA_1    60000   10000   3
BABA_1    70000   10000   0
BABA_1    80000   10000   1
BABA_10   20000   10000   1
BABA_10   30000   10000   16
BABA_10   40000   10000   0
BABA_10   50000   10000   0
BABA_10   60000   10000   0
BABA_10   70000   10000   0
BABA_10   80000   10000   9

相关问题 更多 >