查找关键字以相同前缀开头的字典值的更有效方法

2024-09-28 21:22:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一本字典,它的键是以一组相同的前缀出现的,就像这样:

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

给定一个查询字符串,我需要查找与以该前缀开头的键相关联的所有值,例如,对于query="key1",我需要得到["valA", "valB", "valC"]

我下面的实现可以工作,但是对于大量查询来说太慢了,因为字典d有大约30000个键,并且大多数键的长度超过20个字符:

^{pr2}$

有没有更快/更有效的方法来实现这一点?在


Tags: 字符串字典querykey2key1个字符pr2vale
3条回答

sortedContainerslib有一个SortedDict实现,一旦对dict进行了排序,就可以左等分查找起始位置,右等分查找最后一个位置,然后使用irange获取范围内的键:

from sortedcontainers import SortedDict
from operator import itemgetter
from itertools import takewhile


d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
  "key2":"valD", "key2-22":"valE","key3":"foo" }

key = "key2"
d = SortedDict(sorted(d.items(), key=itemgetter(0)))
start = d.bisect_left(key)
print([d[key] for key in takewhile(lambda x: x.startswith("key2"), d.irange(d.iloc[start]]))
['valD', 'valE']

一旦您维护了一个Sortedict,使用Sortedict会更有效:

^{pr2}$

您可以避免生成dict.keys()(在python2.x中)生成的中间列表:

result = [d[key] for key in d if key.startswith(query)]

但是您很可能希望使用trie而不是字典,这样您就可以找到与具有公共前缀的键相关联的所有值(trie类似于基于前缀的树)。在

Here您可以找到一些不同的尝试实现。在

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn". (source wikipedia)


让我们比较一下不同解决方案的时间安排:

^{pr2}$
# dict without keys()
%timeit [d[s] for s in d if s.startswith(query)]

    100 loops, best of 3: 7.83 ms per loop

# 11.72% improvement

# PyTrie (https://pypi.python.org/pypi/PyTrie/0.2)
import pytrie
pt = pytrie.Trie(d)

%timeit [pt[s] for s in pt.iterkeys(query)]

    1000 loops, best of 3: 320 µs per loop

# 96.36% improvement

# datrie (https://pypi.python.org/pypi/datrie/0.7)
import datrie
dt = datrie.Trie('0123456789')
for key, val in d.iteritems():
    dt[unicode(key)] = val

%timeit [dt[s] for s in dt.keys(unicode(query))]

    10000 loops, best of 3: 162 µs per loop

# 98.17% improvement

您可以使用suffix tree

#!/usr/bin/env python2
from SuffixTree import SubstringDict # $ pip install https://github.com/JDonner/SuffixTree/archive/master.zip

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

a = '\n' # anchor
prefixes = SubstringDict()
for key, value in d.items(): # populated the tree *once*
    prefixes[a + key] = value # assume there is no '\n' in key

for query in ["key1", "key2"]: # perform queries
    print query, prefixes[a + query]

输出

^{pr2}$

相关问题 更多 >