<p>为了完整起见,为了分享我的热情和我学到的东西,下面是我现在使用的代码。它回答了我的问题,还有更多。你知道吗</p>
<p>这部分是基于上述阿卡雷姆的方法。一个函数填充一个dict。它被调用两次,一次用于修复文件,一次用于要修复的文件。你知道吗</p>
<pre><code>import codecs, collections
from GetInfiles import *
sourcefile, targetfile = GetInfiles('dat')
# GetInfiles reads two input parameters from the command line,
# verifies they exist as files with the right extension,
# and then returns their names. Code not included here.
resultfile = targetfile[:-4] + '_result.dat'
def recordlist(infile):
record = collections.OrderedDict()
reclist = []
with codecs.open(infile, 'r', 'utf-8_sig') as f:
for line in f:
try:
key, value = line.split(' ', 1)
except:
key = line
# so this line must be '~EOR~\n'.
# All other lines must have the shape 'tag: content\n'
# so if this errors, there's something wrong with an input file
if not key.startswith('~EOR~'):
try:
record[key].append(value)
except KeyError:
record[key] = [value]
else:
reclist.append(record)
record = collections.OrderedDict()
return reclist
# put files into ordered dicts
source = recordlist(sourcefile)
target = recordlist(targetfile)
# patching
for fix in source:
for record in target:
if fix['ID'] == record['ID']:
record.update(fix)
# write-out
with codecs.open(resultfile, 'w', 'utf-8_sig') as f:
for record in target:
for tag, field in record.iteritems():
for occ in field:
line = u'{} {}'.format(tag, occ)
f.write(line)
f.write('~EOR~\n')
</code></pre>
<p>它现在是一个有序的dict。这不在我的OP中,但是文件需要由人类交叉检查,所以保持顺序会更容易。(<a href="http://pymotw.com/2/collections/ordereddict.html" rel="nofollow">Using OrderedDict is really easy</a>)。我第一次尝试找到这个功能时就想到了odict,但是它的文档让我很担心。没有例子,吓人的行话……)</p>
<p>而且,它现在支持记录中任意给定标记的多次出现。这也不在我的行动中,但我需要这个。(这种格式叫做‘Adlib taged’,是一种编目软件。)</p>
<p>与akaRem的方法不同的是修补,对目标dict使用<code>update</code>,我发现这和python一样非常优雅。对于<code>startswith</code>也是如此。这是我忍不住分享的另外两个原因。你知道吗</p>
<p>我希望它有用。你知道吗</p>