<h2>可能的预处理</h2>
<h2>先删除唯一的URL</h2>
<p>由于您有许多键,但按键只有10个URL,因此一种可能的优化方法是:</p>
<ul>
<li>查找只出现一次的URL。你知道吗</li>
<li>从数据中删除这些URL。你知道吗</li>
<li>删除现在少于2个URL的键。你知道吗</li>
</ul>
<p>根据URL的分布情况,它可能不会更改任何内容,也可能会删除大部分密钥。你知道吗</p>
<p>不管怎么说,这种预处理比暴力的二次解决方案要快得多,所以它可能是值得的。你知道吗</p>
<pre><code>from collections import defaultdict, Counter
keys_by_url = defaultdict(list)
data = {
"key1": ["1", "3", "4", "7"],
"key2": ["7", "3", "2", "1"],
"key3": ["5", "2", "3", "1"],
"key4": ["4", "5", "1", "3"],
"key5": ["8", "9", "x", "3"],
"key6": ["a", "b", "c", "d"]
}
for key, urls in data.items():
for url in urls:
keys_by_url[url].append(key)
# defaultdict(<type 'list'>, {'a': ['key6'], 'c': ['key6'], 'b': ['key6'],
# 'd': ['key6'], '1': ['key3', 'key2', 'key1', 'key4'], '3': ['key3',
# 'key2', 'key1', 'key5', 'key4'], '2': ['key3', 'key2'], '5': ['key3',
# 'key4'], '4': ['key1', 'key4'], '7': ['key2', 'key1'], 'x': ['key5'],
# '9': ['key5'], '8': ['key5']})
for url, keys in keys_by_url.items():
if len(keys) == 1:
unique_key = keys[0]
data[unique_key].remove(url)
# {'key3': ['5', '2', '3', '1'], 'key2': ['7', '3', '2', '1'], 'key1': ['1', '3', '4', '7'], 'key6': [], 'key5': ['3'], 'key4': ['4', '5', '1', '3']}
trimmed_data = {key: values for key,
values in data.items() if len(values) >= 2}
# {'key3': ['5', '2', '3', '1'], 'key2': ['7', '3', '2', '1'], 'key1': ['1', '3', '4', '7'], 'key4': ['4', '5', '1', '3']}
</code></pre>
<p>在上面的例子中,由于预处理,从原始dict中删除了<code>key5</code>和<code>key6</code>。你知道吗</p>
<h2>查找可能的成对键</h2>
<p>对于每个键,可以重用<code>keys_by_url</code>来查找至少有两个共同URL的键:</p>
<pre><code>for key, urls in trimmed_data.items():
possible_keys = Counter(
[other_key for url in urls for other_key in keys_by_url[url] if other_key > key])
print(key)
print([k for k in possible_keys if possible_keys[k] > 1])
print(" -")
</code></pre>
<p>它输出:</p>
<pre><code>key3
['key4']
-
key4
[]
-
key1
['key3', 'key4', 'key2']
-
key2
['key3', 'key4']
-
</code></pre>
<p>这应该是非常快的,只会留下有趣的密钥对。然后你就可以用这两对了@胡安帕.阿里维拉加的<a href="https://stackoverflow.com/a/43553140/6419007">solution</a>而不是<code>itertools.combinations</code>。你知道吗</p>
<h2>较短版本</h2>
<pre><code>from collections import defaultdict, Counter
keys_by_url = defaultdict(list)
data = {
"key1": ["1", "3", "4", "7"],
"key2": ["7", "3", "2", "1"],
"key3": ["5", "2", "3", "1"],
"key4": ["4", "5", "1", "3"],
"key5": ["8", "9", "x", "3"],
"key6": ["a", "b", "c", "d"]
}
for key, urls in data.items():
for url in urls:
keys_by_url[url].append(key)
for key, urls in data.items():
possible_keys = Counter(
[other_key for url in urls for other_key in keys_by_url[url] if other_key > key])
at_least_2_common_urls = [k for k in possible_keys if possible_keys[k] > 1]
if at_least_2_common_urls:
print(key)
print(at_least_2_common_urls)
print(' -')
</code></pre>
<p>它输出:</p>
<pre><code>key1
['key4', 'key3', 'key2']
-
key3
['key4']
-
key2
['key4', 'key3']
-
</code></pre>