为什么随机抽样的比例与数据集而不是样本大小？（pandas.sample（）示例）

import pandas as pd import numpy as np import time as tm #generate a small and a large dataset testSeriesSmall = pd.Series(np.random.randn(10000)) testSeriesLarge = pd.Series(np.random.randn(10000000)) sampleSize = 10 tStart = tm.time() currSample = testSeriesLarge.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart))) tStart = tm.time() currSample = testSeriesSmall.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart))) sampleSize = 1000 tStart = tm.time() currSample = testSeriesLarge.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart))) tStart = tm.time() currSample = testSeriesSmall.sample(n=sampleSize).values print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

2条回答

网友

1楼 · 编辑于 2024-09-25 04:23:37

这似乎是一个内部纽姆问题。我相信pandas sample方法调用numpy.random.choice。让我们看看numpy如何在不同的数组大小和样本大小下执行。在

创建数组

large = np.arange(1000000)
small = np.arange(1000)

在不更换样品的情况下计时

^{pr2}$

更换样品计时

%timeit np.random.choice(large, 10, replace=True)
100000 loops, best of 3: 11.7 µs per loop

%timeit np.random.choice(small, 10, replace=True)
100000 loops, best of 3: 12.2 µs per loop

非常有趣的是，在不进行替换的情况下，大阵列需要的时间要长近3个数量级，而它的大小正好是3个数量级。这意味着numpy随机地对数组排序，然后取前10项。在

使用替换进行采样时，每个值都是独立选择的，因此计时是相同的。在

网友

2楼 · 编辑于 2024-09-25 04:23:37

pandas.Series.sample()在您的例子中，可以归结为：

rs = np.random.RandomState()
locs = rs.choice(axis_length, size=n, replace=False)
return self.take(locs)

慢的部分是rs.choice()：

^{pr2}$

生成一个随机数大约需要10秒！如果你把第一个参数除以10，大约需要1秒。太慢了！在

如果你使用replace=True它是超快的。如果您不介意在结果中有重复的条目，这是一个解决方法。在

choice(replace=False)的NumPy文档说明：

This is equivalent to np.random.permutation(np.arange(5))[:3]

这就很好地解释了这个问题，它生成了一个巨大的可能值数组，对它们进行洗牌，然后取第一个N。这是性能问题的根本原因，并且已经在NumPy中报告为一个问题：https://github.com/numpy/numpy/pull/5158

显然很难在NumPy中修复，因为当使用相同的随机种子值时，人们依赖于choice()不变的结果（在NumPy的不同版本之间）。在

由于您的用例非常狭窄，您可以执行以下操作：

def sample(series, n):
    locs = np.random.randint(0, len(series), n*2)
    locs = np.unique(locs)[:n]
    assert len(locs) == n, "sample() assumes n << len(series)"
    return series.take(locs)

这样可以加快速度：

sample 10 from 10000 values: 0.00735 s
sample 10 from 1000000 values: 0.00944 s
sample 10 from 100000000 values: 1.44148 s
sample 1000 from 10000 values: 0.00319 s
sample 1000 from 1000000 values: 0.00802 s
sample 1000 from 100000000 values: 0.01989 s
sample 100000 from 1000000 values: 0.05178 s
sample 100000 from 100000000 values: 0.93336 s

相关问题更多 >

编程相关推荐

热门问题

热门文章