窗口大小为列值间隔的滚动平均值问题的回答

窗口大小为列值间隔的滚动平均值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

熊猫的<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.indexers.BaseIndexer.html" rel="nofollow noreferrer">BaseIndexer</a>非常方便，尽管需要一点挠头才能弄对 在下面，我使用<a href="https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html" rel="nofollow noreferrer">np.searchsorted</a>快速查找每个窗口的索引（开始，结束）： <pre><code>from pandas.api.indexers import BaseIndexer class RangeWindow(BaseIndexer): def __init__(self, val, width): self.val = val.values self.width = width def get_window_bounds(self, num_values, min_periods, center, closed): if min_periods is None: min_periods = 0 if closed is None: closed = 'left' w = (-self.width/2, self.width/2) if center else (0, self.width) side0 = 'left' if closed in ['left', 'both'] else 'right' side1 = 'right' if closed in ['right', 'both'] else 'left' ix0 = np.searchsorted(self.val, self.val + w[0], side=side0) ix1 = np.searchsorted(self.val, self.val + w[1], side=side1) ix1 = np.maximum(ix1, ix0 + min_periods) return ix0, ix1 </code></pre> 一些高级选项：<code>min_periods</code>、<code>center</code>和<code>closed</code>是根据<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html" rel="nofollow noreferrer">DataFrame.rolling</a>指定的内容实现的 应用程序： <pre><code>df = pd.DataFrame([ [4.5, 10], [4.6, 11], [4.8, 9], [5.5, 6], [5.6, 6], [8.1, 10], [8.2, 13] ], columns='a b'.split()) df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean() # gives: 0 10.0 1 10.0 2 10.0 3 6.0 4 6.0 5 11.5 6 11.5 Name: b, dtype: float64 </code></pre> 计时： <pre><code>df = pd.DataFrame( np.random.uniform(0, 1000, size=(1_000_000, 2)), columns='a b'.split(), ) df = df.sort_values('a').reset_index(drop=True) %%time avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean() CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms Wall time: 135 ms </code></pre> 性能更新： 在@anon01发表评论后，我想知道如果滚动涉及到大窗口，是否可以加快速度。原来我应该先测量熊猫的滚动平均值和总和表现。。。（过早优化，有人吗？） 无论如何，我们的想法是只做一次<code>cumsum</code>，然后取windows端点所解引用的元素的差异： <pre><code># both below working on numpy arrays: def fast_rolling_sum(a, b, width): z = np.concatenate(([0], np.cumsum(b))) ix0 = np.searchsorted(a, a - width/2, side='left') ix1 = np.searchsorted(a, a + width/2, side='right') return z[ix1] - z[ix0] def fast_rolling_mean(a, b, width): z = np.concatenate(([0], np.cumsum(b))) ix0 = np.searchsorted(a, a - width/2, side='left') ix1 = np.searchsorted(a, a + width/2, side='right') return (z[ix1] - z[ix0]) / (ix1 - ix0) </code></pre> 有了这个（以及上面的100万行），我看到： <pre><code>%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0) # 93.9 ms ± 335 µs per loop </code></pre> 与： <pre><code>%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean() # 248 ms ± 1.54 ms per loop </code></pre> 然而<熊猫可能已经在做这样的优化了（这是一个非常明显的优化）。时间不会随着窗口的增大而增加（这就是为什么我说我应该先检查）

窗口大小为列值间隔的滚动平均值

1 个回答

相关Python问题