<p>下面是一个完整的解决方案,它返回与示例相同的数据集,只是速度快了两倍多(以牺牲一些内存为代价):</p>
<pre><code>def identify_duplicates(data):
lookup = {} # store our quick lookup here
result = {} # store for our final result
for i, v in enumerate(data):
if v in lookup: # if already in the lookup table it's a duplicate
if v not in result: # add it to the result set
result[v] = lookup[v]
lookup[v][1] += 1 # increase duplicate count
else:
lookup[v] = [i, 0] # default state for non-duplicates
return result
print(identify_duplicates(doiList))
# prints: {'10.1016/j.ijnurstu.2017.05.011 [doi]': [0, 1]}
</code></pre>
<p>存储的索引是第一次出现已找到的重复项,如您的示例所示。如果要存储所有重复的索引,可以在<code>lookup[v][1] += 1</code>行后添加<code>lookup[v].append(i)</code>,但这样数据可能看起来很奇怪(结构应该是<code>[first_index, number_of_occurrences, second_index, third_index...]</code>)</p>
<p>相反,只需在<code>lookup[v]</code>修改-<code>lookup[v] = [0, i]</code>而不是<code>lookup[v] = [i, 0]</code>和{<cd7>}而不是{<cd2>}中翻转存储的参数,然后<code>lookup[v].append(i)</code>将以:<code>[number_of_occurrences, first_index, second_index, third_index...]</code>的形式给出一个很好的结果。在</p>