将字典中具有相同值的列表分组为键

2条回答

网友

1楼 · 编辑于 2024-09-29 21:23:42

可能的预处理

先删除唯一的URL

由于您有许多键，但按键只有10个URL，因此一种可能的优化方法是：

查找只出现一次的URL。你知道吗
从数据中删除这些URL。你知道吗
删除现在少于2个URL的键。你知道吗

根据URL的分布情况，它可能不会更改任何内容，也可能会删除大部分密钥。你知道吗

不管怎么说，这种预处理比暴力的二次解决方案要快得多，所以它可能是值得的。你知道吗

from collections import defaultdict, Counter

keys_by_url = defaultdict(list)

data = {
    "key1": ["1", "3", "4", "7"],
    "key2": ["7", "3", "2", "1"],
    "key3": ["5", "2", "3", "1"],
    "key4": ["4", "5", "1", "3"],
    "key5": ["8", "9", "x", "3"],
    "key6": ["a", "b", "c", "d"]
}

for key, urls in data.items():
    for url in urls:
        keys_by_url[url].append(key)
# defaultdict(<type 'list'>, {'a': ['key6'], 'c': ['key6'], 'b': ['key6'],
# 'd': ['key6'], '1': ['key3', 'key2', 'key1', 'key4'], '3': ['key3',
# 'key2', 'key1', 'key5', 'key4'], '2': ['key3', 'key2'], '5': ['key3',
# 'key4'], '4': ['key1', 'key4'], '7': ['key2', 'key1'], 'x': ['key5'],
# '9': ['key5'], '8': ['key5']})

for url, keys in keys_by_url.items():
    if len(keys) == 1:
        unique_key = keys[0]
        data[unique_key].remove(url)

# {'key3': ['5', '2', '3', '1'], 'key2': ['7', '3', '2', '1'], 'key1': ['1', '3', '4', '7'], 'key6': [], 'key5': ['3'], 'key4': ['4', '5', '1', '3']}

trimmed_data = {key: values for key,
                values in data.items() if len(values) >= 2}
# {'key3': ['5', '2', '3', '1'], 'key2': ['7', '3', '2', '1'], 'key1': ['1', '3', '4', '7'], 'key4': ['4', '5', '1', '3']}

在上面的例子中，由于预处理，从原始dict中删除了key5和key6。你知道吗

查找可能的成对键

对于每个键，可以重用keys_by_url来查找至少有两个共同URL的键：

for key, urls in trimmed_data.items():
    possible_keys = Counter(
        [other_key for url in urls for other_key in keys_by_url[url] if other_key > key])
    print(key)
    print([k for k in possible_keys if possible_keys[k] > 1])
    print(" -")

它输出：

key3
['key4']
 -
key4
[]
 -
key1
['key3', 'key4', 'key2']
 -
key2
['key3', 'key4']
 -

这应该是非常快的，只会留下有趣的密钥对。然后你就可以用这两对了@胡安帕.阿里维拉加的solution而不是itertools.combinations。你知道吗

较短版本

from collections import defaultdict, Counter

keys_by_url = defaultdict(list)

data = {
    "key1": ["1", "3", "4", "7"],
    "key2": ["7", "3", "2", "1"],
    "key3": ["5", "2", "3", "1"],
    "key4": ["4", "5", "1", "3"],
    "key5": ["8", "9", "x", "3"],
    "key6": ["a", "b", "c", "d"]
}

for key, urls in data.items():
    for url in urls:
        keys_by_url[url].append(key)

for key, urls in data.items():
    possible_keys = Counter(
        [other_key for url in urls for other_key in keys_by_url[url] if other_key > key])
    at_least_2_common_urls = [k for k in possible_keys if possible_keys[k] > 1]
    if at_least_2_common_urls:
      print(key)
      print(at_least_2_common_urls)
      print(' -')

它输出：

key1
['key4', 'key3', 'key2']
 -
key3
['key4']
 -
key2
['key4', 'key3']
 -

网友

2楼 · 编辑于 2024-09-29 21:23:42

所以，我想不出一个聪明的解决方案，不需要将每个值集与其他值集进行比较。我不指望这是非常性能，但它可能足够了。首先，让我们从我们的字典开始：

In [12]: d

Out[12]:
{'key1': ['1', '3', '4', '7'],
 'key2': ['7', '3', '2', '1'],
 'key3': ['5', '2', '3', '1'],
 'key4': ['4', '5', '1', '3']}

现在，我们将把列表改为frozensets

In [14]: ds = {k:frozenset(v) for k,v in d.items()}

In [16]: ds
Out[16]:
{'key1': frozenset({'1', '3', '4', '7'}),
 'key2': frozenset({'1', '2', '3', '7'}),
 'key3': frozenset({'1', '2', '3', '5'}),
 'key4': frozenset({'1', '3', '4', '5'})}

这允许我们获得交叉点，并且可以散列：

In [17]: ds['key1'] & ds['key2']
Out[17]: frozenset({'1', '3', '7'})

In [18]: inverse = {}

所以，现在我们创建一个中间层。这将花费最长的处理时间：

In [20]: import itertools

In [21]: for k1, k2 in itertools.combinations(ds, 2):
     ...:     inverse.setdefault(ds[k1] & ds[k2], []).extend((k1, k2))
     ...:

In [22]: inverse
Out[22]:
{frozenset({'1', '3'}): ['key1', 'key3', 'key2', 'key4'],
 frozenset({'1', '2', '3'}): ['key2', 'key3'],
 frozenset({'1', '3', '7'}): ['key1', 'key2'],
 frozenset({'1', '3', '5'}): ['key3', 'key4'],
 frozenset({'1', '3', '4'}): ['key1', 'key4']}

现在我们有了你想要的数据，但是现在有了你想要的密钥方案。为此，我们可以采取如下措施：

In [23]: from collections import defaultdict

In [24]: lenmap = defaultdict(lambda: itertools.count(1))

最后：

In [25]: inverse2 = {}
     ...: for k, v in inverse.items():
     ...:     inverse2["group{}.{}".format(len(k), next(lenmap[len(k)]))] = v
     ...:

In [26]: inverse2
Out[26]:
{'group2.1': ['key1', 'key3', 'key2', 'key4'],
 'group3.1': ['key2', 'key3'],
 'group3.2': ['key1', 'key2'],
 'group3.3': ['key3', 'key4'],
 'group3.4': ['key1', 'key4']}

不幸的是，创建inverse的步骤将是多项式时间。它至少是O（n^2），其中n是字典的大小，但实际上，我认为它是O（n^2*m），其中m是键之间成对最小长度的最大值。可能不太可怕。编辑刚刚注意到在一条评论中，每个值都是10个URL的列表。在这种情况下，第二个中间步骤仅仅是O（n^2），因为m是常数。你知道吗

可能的预处理

先删除唯一的URL

查找可能的成对键

较短版本

相关问题更多 >

编程相关推荐

热门问题

热门文章