使用约束计算文件中的重复对

D000001 D000001 1975 D000001 D000001 1976 D000001 D002413 1976 D000001 D002413 1979 D000001 D002413 1987 D000001 D004298 1976 D000002 D000002 1985 D000003 D000900 1975 D000003 D000900 1990 D000003 D004134 1983 D000003 D004134 1986

3条回答

网友

1楼 · 编辑于 2024-10-01 02:21:48

由于文件很大，不应使用内存字典来管理数据。开始读取源文件并将结果直接输出到目标文件，您只需要3个变量

一个存储当前元组，第二个存储计数，第三个存储最高值。当元组更改时，将值写入输出文件并继续。在

这一个将有非常小的内存占用和可以处理疯狂的大文件以及。但当然，这只会因为元组是排序的。在

网友

2楼 · 编辑于 2024-10-01 02:21:48

Groupby和generators之路：

import csv
from itertools import groupby

def count_duplicate(it):
    # group by frist two fields
    groups = groupby(it, lambda line: line[:2])
    # this will produce (key, group) pairs, where a group is an iterator
    # containing ['field0', 'field1', year] values were the field0 and field1
    # strings are the same respectively
    # the min_and_count function converts such a group into count and min pair
    def min_and_count(group):
        i, min_year = 0, 99999
        for _, _, year in group:
            i += 1
            min_year = year if year < min_year else min_year
        return (i, min_year)

    yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)


with open("test.srt") as fp:
    # this reads the lines in a lazy fashion and filter empty lines out
    lines = filter(bool, csv.reader(fp, delimiter=' '))
    # convert the last value to integer (still in a lazy fashion)
    lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
    # write result to another file
    with open("result_file", "w") as rf:
        for record in count_duplicate(lines):
            rf.write(str(record) + '\n')

NB:这个解决方案是一个python3.x解决方案，其中filter和{}返回迭代器，而不是像python2.x中那样返回{}

网友

3楼 · 编辑于 2024-10-01 02:21:48

解决方案：

#!/usr/bin/env python


def readdata(filename):
    last = []
    count = 0

    with open(filename, "r") as fd:
        for line in fd:
            tokens = line.strip().split()
            tokens[2] = int(tokens[2])

            if not last:
                last = tokens

            if tokens[:2] != last[:2]:
                yield last[:2], count or 1, last[2]
                last = tokens
                count = 1
            else:
                count += 1

            tokens[2] = min(tokens[2], last[2])

        yield last[:2], count, last[2]


with open("output.txt", "w") as fd:
    for words, count, year in readdata("data.txt"):
        fd.write(
            "{0:s} {1:s} ({2:d} {3:d})\n".format(
                words[0], words[1], count, year
            )
        )

输出：

^{pr2}$

讨论：

它以迭代的方式读取和处理数据（python2.x），因此它不会将所有内容读入内存，从而允许处理非常大的数据文件。在
只要对输入数据进行排序，也不需要复杂的数据结构。我们只需要跟踪最后一组代币，并跟踪每套“重复”的最小年份。在

实际的算法与itertools.groupby非常相似（请参阅使用此方法的另一个答案，但假设Python3.x）。在

可能值得注意的是，这个实现也是``O（n`）（Big O）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章