为什么在python dict中按顺序插入键比无序插入要快

2024-10-02 22:32:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经创建了大量的dicts(数百万个条目),我注意到如果我按顺序创建它们,速度会更快。在

我想这与hash函数的冲突有关,但是有人能解释一下为什么会发生这种情况,以及python版本之间是否一致?在

这里有一个人为的例子:

import timeit
import random

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        otest = [str(i) for i in range(size)]
        utest = list(otest)
        random.shuffle(utest)
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)

def test_ordered():
    dict((k, k) for k in ordered.pop())

def test_unordered():
    dict((k, k) for k in unordered.pop())

print "unordered: ",
print timeit.timeit("test_unordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)

我的机器的输出始终是:

^{pr2}$

我在UbuntuPrecise x86\u64中使用Python2.7.3


Tags: intestimportfordatadeftestsnum
2条回答

我几乎可以肯定的是:当您第一次创建otest时,字符串按顺序存储在内存中。当您创建utest时,字符串指向相同的内存缓冲区,只是现在这些位置无序,这会影响无序测试用例的缓存性能。在

这是证据。我已将您的get_test_data函数替换为以下版本:

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        nums = range(size)
        random.shuffle(nums)
        utest = [str(i) for i in nums]
        otest = list(utest)
        otest.sort(key=lambda x: int(x))
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

我现在在内存中连续地构造ulist中的字符串,然后通过用适当的键对这些字符串排序来构建olist。在我的机器上,这会反转两个测试的运行时间。在

检查source code of the python dict可以看到连续的字符串或整数产生的冲突较少。这结合@skishhore关于更好的缓存本地性的评论可能是答案。在

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

相关问题 更多 >