有没有办法根据模式删除字符串中的重复字符串？

=Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true =Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true =Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

=Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true =Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true =Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

from itertools import groupby data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r')) final = list(k for k,_ in groupby(list(data))) with open("new_file.txt", 'a') as f: for k in final: if k == ['','']: f.write("=Cluster=\n") elif k == ['']: f.write("\n\n") else: f.write("{}\n".join(k))

=Cluster= SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

3条回答

网友

1楼 · 编辑于 2024-09-29 00:14:29

Python中的最短解：p

import os
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")

输出：

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

（如果您使用的是Windows，那么可以使用Gow轻松安装awk。）

网友

2楼 · 编辑于 2024-09-29 00:14:29

我会这样做的。你知道吗

file_in = r'someFile.txt'   
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    seen_spectra = set()
    for line in f_in:
        if '=Cluster=' in line or line.strip() == '':
            seen_spectra = set()
            f_out.write(line)
        else:
            new_spectrum = line.rstrip().split('=')[-1].split()[0]
            if new_spectrum in seen_spectra:
                continue
            else:
                f_out.write(line)
                seen_spectra.add(new_spectrum)

这不是一个groupby解决方案，而是一个可以很容易地遵循和调试的解决方案。正如您在评论中提到的，您的这个文件是16GB大的，将其加载到内存可能不是最好的主意。。你知道吗

EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"

file_in = r'someFile.txt'   
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    seen_spectra = set()
    for line in f_in:
        if line.startswith('SPEC'):
            new_spectrum = line.rstrip().split('=')[-1].split()[0]
            if spectrum in seen_spectra:
                continue
            else:
                seen_spectra.add(new_spectrum)      
                f_out.write(line)          
        else:
            f_out.write(line)

网友

3楼 · 编辑于 2024-09-29 00:14:29

这将打开包含原始代码的文件，以及输出每个组的唯一行的新文件。你知道吗

seen是一个set并且非常适合于查看其中是否已经存在某些东西。你知道吗

data是list并将跟踪"=Cluster="组的迭代。你知道吗

然后您只需查看每个组的每一行（在data中指定为i）。你知道吗

如果seen中不存在该行，则添加该行。你知道吗

with open ("input file", 'r') as in_file, open("output file", 'w') as out_file:
    data = [k.rstrip().split("=Cluster=") for k in in_file]
    for i in data:
        seen = set()
        for line in i:
            if line in seen:
                continue
            seen.add(line)
            out_file.write(line)

编辑：将seen=set()移动到for i in data内，以便每次重置集合，否则"=Cluster="将始终存在，并且不会为data内的每个组打印。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章