检查多个tsv文件，并在python中从每个tsv中删除所有相同的行

with open('file2.tsv') as check_file: check_set = set([row.split('\t')[0].strip().upper() for row in check_file]) with open('file1.tsv', 'r') as in_file, open('file3.tsv', 'w') as out_file: for line in in_file: if line.split('\t')[0].strip().upper() in check_set: out_file.write(line)

1条回答

网友

1楼 · 发布于 2024-10-06 11:29:45

您首先需要读取所有TSV文件，并计算前两列的每次出现次数。Python的^{}可以用于此（基于字典）

读取中的每一行时，将其保存在data字典中，其中键是文件名，内容是前两个值以及原始行的列表。使用^{}可以避免在追加新条目之前，如果条目不存在，就必须添加条目

在读取中的所有内容之后，counts现在可以用来确定任何给定行是否只看到一次，其他值可以跳过

from collections import Counter, defaultdict

counts = Counter()      # hold counts of each first two value pairs
data = defaultdict(list)  # hold all data from all files

for tsv in ['file1.tsv', 'file2.tsv', 'file3.tsv']:
    with open(tsv) as f_tsv:
        for row in f_tsv:
            split = list(map(str.strip, row.split('\t')))
            key = tuple(split[:2])  # first and second column values
            counts[key] += 1
            data[tsv].append((key, row))

for tsv, key_rows in data.items():
    with open('x' + tsv, 'w') as f_tsv:
        for key, row in key_rows:
            if counts[key] == 1:
                f_tsv.write(row)

我建议您添加print()语句，以便更好地理解每个变量所包含的内容，例如print(counts)和print(data)

注意：取出'x' +当准备好时，这是为了将输出文件写入稍微不同的文件名，以避免在测试时覆盖原始文件

相关问题更多 >

编程相关推荐

热门问题

热门文章