<p>我会这样做的。你知道吗</p>
<pre><code>file_in = r'someFile.txt'
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
seen_spectra = set()
for line in f_in:
if '=Cluster=' in line or line.strip() == '':
seen_spectra = set()
f_out.write(line)
else:
new_spectrum = line.rstrip().split('=')[-1].split()[0]
if new_spectrum in seen_spectra:
continue
else:
f_out.write(line)
seen_spectra.add(new_spectrum)
</code></pre>
<p>这不是一个<code>groupby</code>解决方案,而是一个可以很容易地遵循和调试的解决方案。正如您在评论中提到的,您的这个文件是16GB大的,将其加载到内存可能不是最好的主意。。你知道吗</p>
<blockquote>
<p>EDIT: <em>"Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"</em></p>
</blockquote>
<pre><code>file_in = r'someFile.txt'
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
seen_spectra = set()
for line in f_in:
if line.startswith('SPEC'):
new_spectrum = line.rstrip().split('=')[-1].split()[0]
if spectrum in seen_spectra:
continue
else:
seen_spectra.add(new_spectrum)
f_out.write(line)
else:
f_out.write(line)
</code></pre>