根据特定列的len()提取文本行

2024-10-02 18:17:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试编写一个简单的脚本,从VCF文件中提取特定的数据,该文件显示基因组序列中的变体。你知道吗

脚本需要从文件中提取头文件,以及snv,同时省略任何indel。变量显示在两列中,ALT和REF。每列用空格隔开。Indels在ALT或REF中有2个字符,SNVs总是有1个字符。你知道吗

到目前为止,我提取的是标题(总是以###开头),而不是任何变量数据。你知道吗

original_file = open('/home/user/Documents/NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')

for line in original_file:
   if '##' in line:
       extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

for line in original_file:
    ref = str.split()[3]
    alt = str.split()[4]
    if len(ref) == 1 and len(alt) == 1:
        extracted_file.write(line)

original_file.close()
extracted_file.close()

Tags: 文件数据in脚本reflineopenalt
2条回答
original_file = open('NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')
i=0

for line in original_file:
    if '##' in line:
        extracted_file.write(line)
    else:
        ref = line.split('  ')[3]
        alt = line.split('  ')[4]
        if len(ref) == 1 and len(alt) == 1:
            extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

original_file.close()
extracted_file.close()

有两个问题:

  1. 第二个循环永远不会执行,因为您已经到达了第一个循环中VCF文件的末尾。您可以看到here如何在同一文本文件上重新开始新循环。你知道吗
  2. 您没有正确分隔行,因为它是制表符分隔的。你知道吗

所以我将代码设置为只使用一个循环执行,并将tab作为split参数。你知道吗

Adirmola给出的答案很好,但是您可以通过应用一些修改来提高代码质量:

# Use "with" context managers to open files.
# The closing will be automatic, even in case of problems.
with open("NA12878.vcf", "r") as vcf_file, \
        open("NA12878_SNV.txt", "w") as snv_file:
    for line in vcf_file:
        # Since you have specific knowledge of the format
        # you can be more specific when identifying header lines
        if line[:2] == "##":
            snv_file.write(line)
        else:
            # You can avoid doing the splitting twice
            # with list unpacking (using _ for ignored fields)
            # https://www.python.org/dev/peps/pep-3132/
            [_, _, _, ref, alt, *_] = line.split("\t")  # "\t" represents a tab
            if len(ref) == 1 and len(alt) == 1:
                snv_file.write(line)

我在你的文件上用python3.6测试了这个,最终得到了554个snv。 这里使用的一些语法(特别是对于列表解包)可能不适用于较旧的python版本。你知道吗

相关问题 更多 >