python中的文件分组和排序

gene_name length Traes_3AS_4F141FD24.2 24.8 Traes_4AL_A00EF17B2.1 0.0 Traes_4AL_A00EF17B2.1 0.9 Traes_4BS_6943FED4B.1 4.5 Traes_4BS_6943FED4B.1 42.9 UCW_Tt-k25_contig_29046 0.4 UCW_Tt-k25_contig_29046 2.8 UCW_Tt-k25_contig_29046 11.4 UCW_Tt-k25_contig_29046 12.3 UCW_Tt-k25_contig_29046 14.4 UCW_Tt-k25_contig_29046 14.2 UCW_Tt-k25_contig_29046 19.6 UCW_Tt-k25_contig_29046 19.6 UCW_Tt-k25_contig_29046 21.1 UCW_Tt-k25_contig_29046 23.7 UCW_Tt-k25_contig_29046 23.7

from itertools import groupby def iter_hits(hits): for i in range(1,len(hits)): (p, c) = hits[i-1], hits[i] yield p, c def is_overlap(hits): for p, c in iter_hits(hits): if c[1] - p[1] > 10: return True fh = open('my_file','r') oh1 = open('a', 'w') oh2 = open('b', 'w') oh3 = open('c', 'w') for qid, grp in groupby(fh, lambda l: l.split()[0]): hits = [] for line in grp: hsp = line.split() hsp[1]= float(hsp[1]) hits.append(hsp) hits.sort(key=lambda x: x[1]) if len(hits)==1: oh = oh3 elif is_overlap(hits): oh = oh1 else: oh = oh2 for hit in hits: oh.write('\t'.join([str(f) for f in hit])+'\n')

c)Traes_3AS_4F141FD24.2 24.8 b)Traes_4AL_A00EF17B2.1 0.0 Traes_4AL_A00EF17B2.1 0.9 a)Traes_4BS_6943FED4B.1 4.5 Traes_4BS_6943FED4B.1 42.9 UCW_Tt-k25_contig_29046 0.4 UCW_Tt-k25_contig_29046 2.8 UCW_Tt-k25_contig_29046 11.4 UCW_Tt-k25_contig_29046 12.3 UCW_Tt-k25_contig_29046 14.4 UCW_Tt-k25_contig_29046 14.2 UCW_Tt-k25_contig_29046 19.6 UCW_Tt-k25_contig_29046 19.6 UCW_Tt-k25_contig_29046 21.1 UCW_Tt-k25_contig_29046 23.7 UCW_Tt-k25_contig_29046 23.7

2条回答

网友

1楼 · 编辑于 2024-10-02 18:19:50

您的数据似乎已按排序顺序排列，因此您只需比较每组的第一个和最后一个浮点值：

from itertools import groupby

with  open('a', 'w') as uniq, open('b', 'w') as lt, open('c', 'w') as gt:
    with open("foo.txt") as f:
        next(f)
        for _, v in groupby(f, lambda x: x.split(None, 1)[0]):
            v = list(v)
            if len(v) == 1:
                uniq.write(v[0])
            elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) < 10:
                lt.writelines(v)
            elif float(v[-1].split(None, 1)[1]) - float(v[0].split(None, 1)[1]) > 10:
                gt.writelines(v)

网友

2楼 · 编辑于 2024-10-02 18:19:50

如果你的目标是-

I need all genes the lengths of which have differences more than 10 to be in a file, i.e 23.7-0.4 > 10 so it should be in a file.

然后在is_overlap(hits)中，您只需检查最后一个元素和第一个元素之间的差异，因为您在调用此函数之前已经按第二个元素对它们进行了排序，最后一个元素将是最大的，第一个元素将是最小的。你知道吗

因此，你可以-

def is_overlap(hits):
    if hits[-1][1] - hits[0][1] > 10:
        return True

相关问题更多 >

编程相关推荐

热门问题

热门文章