是否可以使用大于1的pandas.DataFrame.rolling?

2024-10-02 18:21:12 发布

您现在位置:Python中文网/ 问答频道 /正文

在R中,您可以计算具有指定窗口的滚动平均值,该窗口每次可以移动指定的量

然而,也许我在任何地方都找不到它,但在pandas或其他Python库中似乎都找不到它

有人知道解决这个问题的方法吗?我给你举个例子来说明我的意思:

example

这里我们有两周的数据,我在计算两个月的移动平均值,移动1个月,也就是2行

所以在R中,我会做一些类似的事情:two_month__movavg=rollapply(mydata,4,mean,by = 2,na.pad = FALSE) Python中没有等价物吗

编辑1:

DATE  A DEMAND   ...     AA DEMAND  A Price
    0  2006/01/01 00:30:00  8013.27833   ...     5657.67500    20.03
    1  2006/01/01 01:00:00  7726.89167   ...     5460.39500    18.66
    2  2006/01/01 01:30:00  7372.85833   ...     5766.02500    20.38
    3  2006/01/01 02:00:00  7071.83333   ...     5503.25167    18.59
    4  2006/01/01 02:30:00  6865.44000   ...     5214.01500    17.53

Tags: 数据方法pandas地方mean事情例子平均值
3条回答

您可以再次使用滚动,只需要一点工作就可以分配索引

这里by = 2

by = 2

df.loc[df.index[np.arange(len(df))%by==1],'New']=df.Price.rolling(window=4).mean()
df
    Price    New
0      63    NaN
1      92    NaN
2      92    NaN
3       5  63.00
4      90    NaN
5       3  47.50
6      81    NaN
7      98  68.00
8     100    NaN
9      58  84.25
10     38    NaN
11     15  52.75
12     75    NaN
13     19  36.75

所以,我知道这个问题已经问了很长时间了,因为我碰到了同样的问题,在处理长时间序列时,你真的希望避免对你不感兴趣的值进行不必要的计算。由于该方法不实现step参数,所以我使用numpy编写了一个变通方法

它基本上是this link中的解决方案和BENY提出的索引的组合

def apply_rolling_data(data, col, function, window, step=1, labels=None):
    """Perform a rolling window analysis at the column `col` from `data`

    Given a dataframe `data` with time series, call `function` at
    sections of length `window` at the data of column `col`. Append
    the results to `data` at a new columns with name `label`.

    Parameters
    ----------
    data : DataFrame
        Data to be analyzed, the dataframe must stores time series
        columnwise, i.e., each column represent a time series and each
        row a time index
    col : str
        Name of the column from `data` to be analyzed
    function : callable
        Function to be called to calculate the rolling window
        analysis, the function must receive as input an array or
        pandas series. Its output must be either a number or a pandas
        series
    window : int
        length of the window to perform the analysis
    step : int
        step to take between two consecutive windows
    labels : str
        Name of the column for the output, if None it defaults to
        'MEASURE'. It is only used if `function` outputs a number, if
        it outputs a Series then each index of the series is going to
        be used as the names of their respective columns in the output

    Returns
    -------
    data : DataFrame
        Input dataframe with added columns with the result of the
        analysis performed

    """

    x = _strided_app(data[col].to_numpy(), window, step)
    rolled = np.apply_along_axis(function, 1, x)

    if labels is None:
        labels = [f"metric_{i}" for i in range(rolled.shape[1])]

    for col in labels:
        data[col] = np.nan

    data.loc[
        data.index[
            [False]*(window-1)
            + list(np.arange(len(data) - (window-1)) % step == 0)],
        labels] = rolled

    return data


def _strided_app(a, L, S):  # Window len = L, Stride len/stepsize = S
    """returns an array that is strided
    """
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(
        a, shape=(nrows, L), strides=(S*n, n))

如果数据大小不太大,以下是一种简单的方法:

by = 2
win = 4
start = 3 ## it is the index of your 1st valid value.
df.rolling(win).mean()[start::by] ## calculate all, choose what you need.

相关问题 更多 >