使用Python迭代两个大列表

2024-05-31 08:40:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个文件都是制表符分隔的。其中一个文件几乎是800k行,它是一个Exonic坐标文件,另一个文件几乎是200k行(它是一个VCF文件)。你知道吗

我正在用python编写一个代码来查找和过滤VCF中位于exonic坐标(Exon Start和End from exonic coordinates File)内的位置,并将其写入一个文件。你知道吗

但是,因为文件很大,所以需要几天才能得到过滤后的输出文件?你知道吗

所以下面的代码部分解决了速度问题,但问题是要弄清楚,是为了加快过滤过程,这就是为什么我用中断退出第二个循环,我想从外循环的开始开始,而不是从第一个循环(外循环)取下一个元素?你知道吗

这是我的密码:

import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
                 list_coord.append((row[0],row[1],row[2]))

    def parseVcf(vcf,src):
        done = False
        with open(vcf,'r') as f:
                    reader=csv.reader((f),delimiter='\t')
                    vcf_out_split = vcf.split('.')
                    vcf_out_split.insert(2,"output_CORRECT2")
                    outpt = open('.'.join(vcf_out_split),'a')
                    for coord in list_coord:
                            for row in reader:
                                   if '#' not in row[0]:
                                            coor_genom = int(row[1])
                                            coor_exon1 = int(coord[1])+1
                                            coor_exon2 = int(coord[2])
                                            coor_genom_chr = row[0]
                                            coor_exon_chr = coord[0]
                                            ComH = row[7].split(';')
                                            for x in ComH:
                                               if 'DP4=' in x:
                                                 DP4_split=x[4:].split(',')
                                                 if (coor_exon1 <= coor_genom <= coor_exon2):
                                                    if (coor_genom_chr == coor_exon_chr):
                                                       if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
                                                         done = True

                                                         outpt.write('\t'.join(row) + '\n')

                                            if done:
                                                    break
                    outpt.close()
for root,dirs,files in os.walk("."):
    for file in files:
      pathname=os.path.join(root,file)
      if file.find("1_1")==0:
        print "Parsing " + file
        parseVcf(pathname, "1_1")

参考_已订购.txt地址:

1   69090   70008
1   367658  368597
1   621095  622034
1   861321  861393
1   865534  865716
1   866418  866469
1   871151  871276
1   874419  874509

1\u 1输入文件:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT     directory
1   14907   rs79585140  A   G   20  .   DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ    0/1:50,0,51:50
1   14930   rs75454623  A   G   44  .   DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38  GT:PL:GQ    0/1:74,0,58:61
1   15211   rs78601809  T   G   9.33    .   DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171    GT:PL:GQ    1/1:41,10,0:13
1   16146   .   A   C   25  .   DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ    0/1:55,0,68:58
1   16257   rs78588380  G   C   40  .   DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308   GT:PL:GQ    0/1:70,0,249:73
1   16378   rs148220436 T   C   39  .   DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227    GT:PL:GQ    0/1:69,0,90:72

输出文件:

1   877831  rs6672356   T   C   44.8    .   DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ    1/1:76,6,0:10
1   878000  .   C   T   44.8    .   DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA    GT:PL:GQ    1/1:76,6,0:10
1   881627  rs2272757   G   A   205 .   DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1   GT:PL:GQ    1/1:238,27,0:51

Tags: 文件byrowdpsplitmqxmfq
2条回答

首先,我没有包括任何代码,因为它看起来像我的家庭作业(我有这样的家庭作业)。不过,我将尝试解释我为改进脚本所采取的步骤,尽管我知道我的解决方案还远远不够完美。你知道吗

您的脚本可能会很慢,因为对于csv文件中的每一行,您都要打开、写入和关闭输出文件。尝试列出要添加到输出文件中的行,在完成读取和筛选后,开始写入。你知道吗

您可能还需要考虑为每个过滤器编写函数,并以行作为变量调用这些函数。这样以后就可以很容易地添加过滤器了。我使用一个计数器来跟踪成功的过滤器的数量,如果最后counter == len(amountOfUsedFilers)我将我的行添加到列表中。你知道吗

另外,你为什么要使用outpt = open('.'.join(vcf_out_split),'a')with open(vcf,'r') as f:来尝试在你的选择中保持一致和聪明。你知道吗

生物信息学的胜利!你知道吗

如果你的两个文件都是有序的,你可以通过并行地遍历它们来节省大量的时间,总是在坐标最低的那个文件上前进。这样,您将只处理每一行一次,而不是多次。你知道吗

下面是您的代码的基本版本,它只执行坐标检查(我不完全理解您的DP4条件,所以我将留给您重新添加该部分):

with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
    coords = csv.reader(coords_f, delimiter="\t")
    vcf = csv.reader(vcf_f, delimiter="\t")
    out = csv.writer(out_f, delimiter="\t")

    next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!

    try:
        c = next(coords)
        r = next(vcf)

        while True:
            if int(c[1]) >= int(r[1]):   # vcf file is behind
                r = next(vcf)
            elif int(c[2]) < int(r[1]):  # coords file is behind
                c = next(coords)
            else:                        # int(c[1]) < int(r[1]) <= int(c[2])
                out.writeline(r)  # add DP4 check here, and indent this line under it

                r = next(vcf)     # don't indent this line
    except StopIteration: # one of the files has ended
        pass

相关问题 更多 >