按匹配项分组问题的回答

按匹配项分组

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这种方法比其他方法更冗长，但它可能更有效，这取决于几个因素（见我最后的注释）。除非您正在处理大量包含大量配置项的文件，否则我甚至不会考虑将此应用于其他一些建议，但如果性能是一个问题，则此算法可能会有所帮助。你知道吗 从配置字符串到文件集（称为<code>c2f</code>，从文件到配置字符串集（<code>f2c</code>）的字典开始。这两个都可以在全局文件时生成。你知道吗 很明显，c2f是一个字典，其中键是字符串，值是文件集。f2c是一个字典，其中键是文件，值是字符串集。你知道吗 循环f2c的文件键和一个数据项。使用c2f查找包含该项的所有文件。这些是您需要比较的唯一文件。你知道吗 以下是工作代码： <pre><code># this structure simulates the files system and contents. cfg_data = { "config1.txt": ["1.1.1.1", "2.2.2.2"], "config2.txt": ["1.1.1.1"], "config3.txt": ["2.2.2.2", "1.1.1.1"], "config4.txt": ["2.2.2.2"] } # Build the dictionaries (this is O(n) over the lines of configuration data) f2c = dict() c2f = dict() for file, data in cfg_data.iteritems(): data_set = set() for item in data: data_set.add(item) if not item in c2f: c2f[item] = set() c2f[item].add(file) f2c[file] = data_set; # build the results as a list of pairs of lists: results = [] # track the processed files processed = set() for file, data in f2c.iteritems(): if file in processed: continue size = len(data) equivalence_list = [] # get one item from data, preferably the one used by the smallest list of # files. item = None item_files = 0 for i in data: if item == None: item = i item_files = len(c2f[item]) elif len(c2f[i]) < item_files: item = i item_files = len(c2f[i]) # All files with the same data as f must have at least the first item of # data, just look at those files. for other_file in c2f[item]: other_data = f2c[other_file] if other_data == data: equivalence_list.append(other_file) # No need to visit these files again processed.add(other_file) results.append((data, equivalence_list)) # Display the results for data, files in results: print data, ':', files </code></pre> 添加一个计算复杂性的注释：从技术上讲，这是O（（K log N）*（L log M）），其中N是文件的数量，M是唯一配置项的数量，K（&lt；=N）是具有相同内容的文件的组的数量，L（&lt；=M）是每L个已处理文件必须成对比较的平均文件数。如果K&lt；&lt；N和L&lt；&lt；M是有效的

按匹配项分组

1 个回答

相关Python问题