为什么在python dict中按顺序插入键比无序插入要快

import timeit import random def get_test_data(num, size): olist, ulist = [], [] for _ in range(num): otest = [str(i) for i in range(size)] utest = list(otest) random.shuffle(utest) olist.append(otest) ulist.append(utest) return olist, ulist NUM_TESTS = 20 # Precalculate the test data so we only measure dict creation time ordered, unordered = get_test_data(NUM_TESTS, 1000000) def test_ordered(): dict((k, k) for k in ordered.pop()) def test_unordered(): dict((k, k) for k in unordered.pop()) print "unordered: ", print timeit.timeit("test_unordered()", setup="from __main__ import test_unordered, test_ordered", number=NUM_TESTS) print "ordered: ", print timeit.timeit("test_ordered()", setup="from __main__ import test_unordered, test_ordered", number=NUM_TESTS)

2条回答

网友

1楼 · 编辑于 2024-10-02 22:32:20

我几乎可以肯定的是：当您第一次创建otest时，字符串按顺序存储在内存中。当您创建utest时，字符串指向相同的内存缓冲区，只是现在这些位置无序，这会影响无序测试用例的缓存性能。在

这是证据。我已将您的get_test_data函数替换为以下版本：

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        nums = range(size)
        random.shuffle(nums)
        utest = [str(i) for i in nums]
        otest = list(utest)
        otest.sort(key=lambda x: int(x))
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

我现在在内存中连续地构造ulist中的字符串，然后通过用适当的键对这些字符串排序来构建olist。在我的机器上，这会反转两个测试的运行时间。在

网友

2楼 · 编辑于 2024-10-02 22:32:20

检查source code of the python dict可以看到连续的字符串或整数产生的冲突较少。这结合@skishhore关于更好的缓存本地性的评论可能是答案。在

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

相关问题更多 >

编程相关推荐

热门问题

热门文章