修复python中的代码以更改文本格式

chr1 37091 37122 D00645:305:CCVLRANXX:1:1104:21074:48301 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1104:4580:50451 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1106:13064:5974 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1106:16735:48726 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:2210:5043:83540 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:2204:15744:24410 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:2204:19627:73060 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:2206:8497:68295 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:11371:24672 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:17050:42431 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:12969:62696 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:6478:73521 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:8402:80222 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1309:19837:15007 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1309:20126:89687 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1310:2838:27860 0 - chr1 37091 37122 D00645:305:CCVLRANXX:1:1310:7280:85906 0 - chr1 54832 54863 D00645:305:CCVLRANXX:1:2102:19886:3949 0 - chr1 74307 74338 D00645:305:CCVLRANXX:1:2203:13233:29983 0 - chr1 74325 74356 D00645:305:CCVLRANXX:1:1310:7266:92995 0 - chr1 93529 93560 D00645:305:CCVLRANXX:1:1103:1743:29602 0 + chr1 93529 93560 D00645:305:CCVLRANXX:1:1101:16098:97354 0 +

infile = open('infile.txt', 'rb') content = [] for i in infile: content.append(i.split()) final = [] for j in range(len(content)): if content[j] == content[j-1]: final.append(content[j]) with open('outfile.txt','w') as f: for sublist in final: for item in sublist: f.write(item + '\t') f.write('\n')

3条回答

网友

1楼 · 编辑于 2024-09-28 05:16:17

您还可以使用pandas，您的解决方案将非常简单：

只需读取熊猫dataframe中的大txt文件，如：

df = pd.read_csv('infile.txt', sep=' ')
df.groupby([0,1,2]).count()

这应该给你：

chr1 37091 37122     17
     74325 74356      1
     93529 93560      2

如果这有帮助，请告诉我。你知道吗

网友

2楼 · 编辑于 2024-09-28 05:16:17

可以这样使用Counter：

from collections import Counter

infile = open('infile.txt', 'rb')
content = []
for i in infile:
    # append only first 3 columns as one line string
    content.append('  '.join(i.split()[:3]))

# this is now dictionary
c = Counter(content)


elements = c.most_common(len(c.elements()))

with open('outfile.txt','w') as f:
    for item, freq in elements:
        f.write('{}\t{}\n'.format(item, freq))

网友

3楼 · 编辑于 2024-09-28 05:16:17

可以将目标比较行作为键使用常规词典：

infile = 'infile.txt'
content = {}

with open(infile, 'r') as fin:
    for line in fin:
        temp = line.split()
        if not temp[1]+temp[2] in content:
            content[temp[1]+temp[2]] = [1, temp[0:3]]
        else:
            content[temp[1]+temp[2]][0]+=1

with open('outfile.txt','w') as fout:
    for key, value in content.items():
        for entry in value[1]:
            fout.write(entry + ' ')
        fout.write(str(value[0]) + '\n')

键是连接的第二列和第三列。值是一个列表-第一个元素是计数器，第二个元素是输入文件中要保存到输出的值的列表。if检查是否已经有一个具有给定键的条目-如果是，则递增计数器；如果不是，则创建一个新的列表，其中counter设置为1，适当的值作为列表部分。你知道吗

请注意，为了保持一致性，程序在这两种情况下都使用建议的with open。它也不会以二进制模式读取txt文件。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章