在两个文件中查找公共元素并将它们合并到一个文件中

2024-09-28 23:16:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个制表符分隔的文件1

marker1 transcript0 scaff1 1 24
marker2 transcript1 scaff2 1 53
marker3 transcript1 scaff2 1 53
marker4 transcript2 scaff3 1 89
marker5 transcript2 scaff3 1 89
marker6 transcript2 scaff3 1 89

和文件2

contig1 transcript1 scaff2 1 53
contig2 transcript1 scaff2 1 53
contig3 transcript1 scaff2 1 53
contig4 transcript2 scaff3 1 89

我想要的输出文件是

transcript1 marker2 contig1 scaff2 1 53
transcript1 marker3 contig2 scaff2 1 53
transcript1 0       contig3 scaff2 1 53
transcript2 marker4 contig4 scaff3 1 89
transcript2 marker5 0       scaff3 1 89
transcript2 marker6 0       scaff3 1 89

基本上,我需要合并两个文件,如果有共同的成绩单。这两个文件的长度不同。我试过使用字典和连接逗号行,但是效果不好。你能给我一些归纳或想法,我怎样才能在python上做到这一点? 我试过加入

 join -1 2 -2 2 file1 file2

这个密码

f1=open('file1','r')
f2=open('file2','r')
output = open('common','w')

dictA= dict()
for line1 in f1:
    listA = line1.rstrip('\n').split('\t')
    dictA[listA[1]] = listA

for line1 in f2:
    new_list=line1.rstrip('\n').split('\t')
    query=new_list[0]
    subject=new_list[1]
    scaff=new_list[2]
    chrom=new_list[3]
    cm=new_list[4]
    if subject in dictA:
        listA = dictA[subject]
        output.write(subject+'\t'+query+'\t'+str(listA[0])+'\t'+str(listA[1])+'\t'+str(listA[2])+'\t'+str(listA[3])+'\t'+chrom+'\t'+cm+'\t'+scaff+'\n')
output.close()

Tags: 文件innewoutputopenlistsubjectstr
1条回答
网友
1楼 · 发布于 2024-09-28 23:16:11

这个怎么样(Python 3):

from collections import defaultdict
from itertools import zip_longest

with open('file1', 'r') as f1, open('file2', 'r') as f2, \
                               open('common', 'w') as fout:
    remainder = {}
    markers = defaultdict(list)
    for line in f1:
        fields = line.split()
        markers[fields[1]].append(fields[0])
        remainder[fields[1]] = fields[2:]

    contigs = defaultdict(list)
    for line in f2:
        fields = line.split()
        contigs[fields[1]].append(fields[0])
        remainder[fields[1]] = fields[2:]

    print(remainder)
    transcripts = sorted(set(markers.keys()) | set(contigs.keys()))
    for transcript in transcripts:
        rest = remainder[transcript]
        zipped = zip_longest(markers[transcript], contigs[transcript],
                             fillvalue='0')
        for marker, contig in zipped:
            print(transcript, marker, contig, *rest, sep='\t')

输出:

transcript0 marker1 0   scaff1  1   24
transcript1 marker2 contig1 scaff2  1   53
transcript1 marker3 contig2 scaff2  1   53
transcript1 0   contig3 scaff2  1   53
transcript2 marker4 contig4 scaff3  1   89
transcript2 marker5 0   scaff3  1   89
transcript2 marker6 0   scaff3  1   89

相关问题 更多 >