python中的文件分组和排序

2024-10-02 18:19:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样的制表符分隔文件

gene_name               length
Traes_3AS_4F141FD24.2   24.8    
Traes_4AL_A00EF17B2.1   0.0 
Traes_4AL_A00EF17B2.1   0.9 
Traes_4BS_6943FED4B.1   4.5 
Traes_4BS_6943FED4B.1   42.9    
UCW_Tt-k25_contig_29046 0.4 
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4    
UCW_Tt-k25_contig_29046 12.3    
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2    
UCW_Tt-k25_contig_29046 19.6    
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1    
UCW_Tt-k25_contig_29046 23.7    
UCW_Tt-k25_contig_29046 23.7

我需要按基因名称分组,并将文件分为3个文件:1)如果基因名称是唯一的2)如果组内基因之间的长度差异为>;10 3)如果组内基因之间的长度差异为<;10。 这是我的尝试

from itertools import groupby

def iter_hits(hits):
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        yield p, c

def is_overlap(hits):
    for p, c in iter_hits(hits):
        if c[1] - p[1] > 10:
            return True

fh = open('my_file','r')
oh1 = open('a', 'w')
oh2 = open('b', 'w')
oh3 = open('c', 'w')

for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[1]= float(hsp[1])
        hits.append(hsp)
    hits.sort(key=lambda x: x[1])
    if len(hits)==1:
        oh = oh3
    elif is_overlap(hits):
        oh = oh1
    else:
        oh = oh2

    for hit in hits:
        oh.write('\t'.join([str(f) for f in hit])+'\n')

我需要的输出是:

c)Traes_3AS_4F141FD24.2   24.8          b)Traes_4AL_A00EF17B2.1   0.0 
                                          Traes_4AL_A00EF17B2.1   0.9 
a)Traes_4BS_6943FED4B.1   4.5 
Traes_4BS_6943FED4B.1   42.9    
UCW_Tt-k25_contig_29046 0.4 
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4    
UCW_Tt-k25_contig_29046 12.3    
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2    
UCW_Tt-k25_contig_29046 19.6    
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1    
UCW_Tt-k25_contig_29046 23.7    
UCW_Tt-k25_contig_29046 23.7

另外,我很抱歉问了这么长的问题,否则我可能解释不好。你知道吗


Tags: 文件in名称for基因openohhits
2条回答

您的数据似乎已按排序顺序排列,因此您只需比较每组的第一个和最后一个浮点值:

from itertools import groupby

with  open('a', 'w') as uniq, open('b', 'w') as lt, open('c', 'w') as gt:
    with open("foo.txt") as f:
        next(f)
        for _, v in groupby(f, lambda x: x.split(None, 1)[0]):
            v = list(v)
            if len(v) == 1:
                uniq.write(v[0])
            elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) < 10:
                lt.writelines(v)
            elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) > 10:
                gt.writelines(v)

如果你的目标是-

I need all genes the lengths of which have differences more than 10 to be in a file, i.e 23.7-0.4 > 10 so it should be in a file.

然后在is_overlap(hits)中,您只需检查最后一个元素和第一个元素之间的差异,因为您在调用此函数之前已经按第二个元素对它们进行了排序,最后一个元素将是最大的,第一个元素将是最小的。你知道吗

因此,你可以-

def is_overlap(hits):
    if hits[-1][1] - hits[0][1] > 10:
        return True

相关问题 更多 >