<p>这种方法比其他方法更冗长,但它可能更有效,这取决于几个因素(见我最后的注释)。除非您正在处理大量包含大量配置项的文件,否则我甚至不会考虑将此应用于其他一些建议,但如果性能是一个问题,则此算法可能会有所帮助。你知道吗</p>
<p>从配置字符串到文件集(称为<code>c2f</code>,从文件到配置字符串集(<code>f2c</code>)的字典开始。这两个都可以在全局文件时生成。你知道吗</p>
<p>很明显,c2f是一个字典,其中键是字符串,值是文件集。f2c是一个字典,其中键是文件,值是字符串集。你知道吗</p>
<p>循环f2c的文件键和一个数据项。使用c2f查找包含该项的所有文件。这些是您需要比较的唯一文件。你知道吗</p>
<p>以下是工作代码:</p>
<pre><code># this structure simulates the files system and contents.
cfg_data = {
"config1.txt": ["1.1.1.1", "2.2.2.2"],
"config2.txt": ["1.1.1.1"],
"config3.txt": ["2.2.2.2", "1.1.1.1"],
"config4.txt": ["2.2.2.2"]
}
# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()
for file, data in cfg_data.iteritems():
data_set = set()
for item in data:
data_set.add(item)
if not item in c2f:
c2f[item] = set()
c2f[item].add(file)
f2c[file] = data_set;
# build the results as a list of pairs of lists:
results = []
# track the processed files
processed = set()
for file, data in f2c.iteritems():
if file in processed:
continue
size = len(data)
equivalence_list = []
# get one item from data, preferably the one used by the smallest list of
# files.
item = None
item_files = 0
for i in data:
if item == None:
item = i
item_files = len(c2f[item])
elif len(c2f[i]) < item_files:
item = i
item_files = len(c2f[i])
# All files with the same data as f must have at least the first item of
# data, just look at those files.
for other_file in c2f[item]:
other_data = f2c[other_file]
if other_data == data:
equivalence_list.append(other_file)
# No need to visit these files again
processed.add(other_file)
results.append((data, equivalence_list))
# Display the results
for data, files in results:
print data, ':', files
</code></pre>
<p>添加一个计算复杂性的注释:从技术上讲,这是O((K log N)*(L log M)),其中N是文件的数量,M是唯一配置项的数量,K(<;=N)是具有相同内容的文件的<em>组</em>的数量,L(<;=M)是每L个已处理文件必须成对比较的平均文件数。如果K<;<;N和L<;<;M是有效的</p>