<p>Groupby和generators之路:</p>
<pre><code>import csv
from itertools import groupby
def count_duplicate(it):
# group by frist two fields
groups = groupby(it, lambda line: line[:2])
# this will produce (key, group) pairs, where a group is an iterator
# containing ['field0', 'field1', year] values were the field0 and field1
# strings are the same respectively
# the min_and_count function converts such a group into count and min pair
def min_and_count(group):
i, min_year = 0, 99999
for _, _, year in group:
i += 1
min_year = year if year < min_year else min_year
return (i, min_year)
yield from map(lambda x: x[0] + [min_and_count(x[1])], groups)
with open("test.srt") as fp:
# this reads the lines in a lazy fashion and filter empty lines out
lines = filter(bool, csv.reader(fp, delimiter=' '))
# convert the last value to integer (still in a lazy fashion)
lines = map(lambda line: [line[0], line[1], int(line[2])], lines)
# write result to another file
with open("result_file", "w") as rf:
for record in count_duplicate(lines):
rf.write(str(record) + '\n')
</code></pre>
<p><strong>NB:</strong>这个解决方案是一个python3.x解决方案,其中<code>filter</code>和{<cd2>}返回迭代器,而不是像python2.x中那样返回{<cd3>}</p>