<p>只需使用dict映射到每个SSN的唯一id来记录所看到的SSN,您只需要对行进行一次传递,并使用<a href="https://docs.python.org/3.4/library/csv.html" rel="nofollow">csv module</a>来解析将为您执行拆分的文件。如果您想要一个全新的文件:</p>
<pre><code>import csv
cn = 10001
with open("test.txt") as f, open("out.txt","w") as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"]+head)
for row in r:
v = row[4]
# if we have already seen the SSN, use the id assigned
if v in d:
wr.writerow([d[v]] + row)
else:
# else create new id, add pairing to dict and write
d[v] = cn
wr.writerow([cn] + row)
cn += 1
</code></pre>
<p>输出:</p>
<pre><code>ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014
</code></pre>
<p>如果要更新原始文件,可以写入tempfile并执行<code>shutil.move</code>:</p>
<pre><code>import csv
from shutil import move
from tempfile import NamedTemporaryFile
import os
cn = 100001
try:
with open("test.txt") as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"] + head)
for row in r:
v = row[4]
if v in d:
wr.writerow([d[v]] + row)
else:
d[v] = cn
wr.writerow([cn] + row)
cn += 1
# replace original file
move(tmp.name, "test.txt"))
finally:
if os.path.isfile(tmp.name):
os.unlink(tmp.name)
</code></pre>
<p>如果数据的顺序与输入的顺序相同,则可以<code>groupby</code>:</p>
<pre><code>import csv
from itertools import groupby
from operator import itemgetter
cn = 10001
with open("test.txt") as f, open("out.txt", "w") as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"] + head)
for k, v in groupby(r, key=itemgetter(4)):
wr.writerows([cn]+sub for sub in v)
cn += 1
</code></pre>