在“接近”日期的级别上重新索引多索引

2024-06-25 23:27:50 发布

您现在位置:Python中文网/ 问答频道 /正文

问题

我有一个pandas.Series和一个两级pandas.MultiIndex。第一层是日期。我还有另一个DatetimeIndex,它的值接近于我的series.index.levels[0]中的一些日期。我想用“other”DatetimeIndex中与索引中现有日期足够接近的日期来重新索引我的序列。假设我所说的“关闭”是指两天之内。你知道吗

设置

import pandas as pd
import numpy as np

np.random.seed([3, 1415])

chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

#                   Equal Date     + 3 Days      - 1 Day       + 2 Days
i0 = pd.to_datetime(
    [              '2018-11-30', '2018-12-16', '2018-12-30', '2019-01-17'              ])
i1 = pd.to_datetime(
    ['2018-10-31', '2018-11-30', '2018-12-13', '2018-12-31', '2019-01-15', '2019-01-31'])
#                    Include       Skip          Include       Include

lvl0 = i0.repeat(5)
lvl1 = np.concatenate(
    [np.random.choice([*chars], size=5, replace=False) for _ in range(4)])

midx = pd.MultiIndex.from_tuples([*zip(lvl0, lvl1)], names=['date', 'ID'])

s0 = pd.Series(np.arange(4).repeat(5), midx, name='stuff')

s0

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-16  Q     1
            B     1
            A     1
            S     1
            P     1
2018-12-30  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-17  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

我想要的是

注:与原件相同dtype

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-31  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-15  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

我所做的

tol = pd.Timedelta('2D')

# 0. This should be the same as the `i0` I used to set up
#    But supposing that wasn't available, we would...
i0 = s0.index.levels[0]

# 1. Broadcast date differences
# 2. Take the absolute value
# 3. Find the position of minimum absolute value for each row
# 4. Define a proposal of new index level values with those positions
i_proposal = i1[np.abs(np.subtract.outer(i0, i1)).argmin(1)]

# 5. Use proposal to get which ones are within the
#    tolerance of 2 days
i_final = i_proposal[np.abs(i_proposal - i0) <= tol]

# 6. set_levels with proposal.
#    because at this point there is a one-to-one correspondance
s0.index.set_levels(i_proposal, level=0, inplace=True)

# 7. use `loc` to pull out the final ones
s0.loc[i_final]

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-31  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-15  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

我的解决方案有问题

  1. 这与“平滑”相反
  2. i0.index上操作inplace
  3. 大O(len(i0)*len(i1))。应该有一个大O(len(i0)+len(i1))的解决方案。你知道吗

有人能想出更好的办法吗?你知道吗


Tags: thetoiddateindexlennppd
2条回答

这是一个^{}问题。我会这样做:

res = pd.merge_asof(
        s0.to_frame(),                  # should be first, simulate how='left'
        i1.to_frame(),                  # should be second 
        tolerance=pd.Timedelta(days=2), # two days tolerance
        left_on='date',                 # select index level for s0
        right_index=True,               
        direction='nearest')            # default is 'backward', not as useful

s0[res[0].notna()]

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-30  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-17  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

注意,这将保留来自s0的索引(这可能不是您想要的)。你知道吗


piR编辑

这就是我想要的

tol = pd.Timedelta(days=2)
right = pd.DataFrame(dict(newdate=i1), i1)
left = s0.to_frame()

kw = dict(
    left=left, right=right, tolerance=tol,
    left_on='date', right_index=True, direction='nearest'
)

res = pd.merge_asof(**kw)
res = res.dropna() \
         .reset_index() \
         .set_index(['newdate', 'ID']) \
         .stuff.rename_axis(['date', 'ID'])
res

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-31  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-15  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

这与cs95使用reindex所做的非常接近

s,y=i1.reindex(s0.index.levels[0],tolerance=pd.Timedelta(days=2),method='nearest')

s0.loc[s[y!=-1]]

如果需要,将索引级别1更改为l1

s=s0.index.levels[0].values
t=abs((i1[:,None]-s))/np.timedelta64(1, 'D')<=2

f=s0.loc[s[t.any(0)]].reset_index(level=1)

f.index=f.index.map(dict(zip(s[t.any(0)],i1[t.any(1)])))
f.set_index('ID',append=True,inplace=True)
f
Out[458]: 
               stuff
date       ID       
2018-11-30 S       0
           O       0
           J       0
           H       0
           D       0
2018-12-31 U       2
           S       2
           A       2
           J       2
           L       2
2019-01-15 K       3
           U       3
           V       3
           S       3
           H       3

piR编辑

我这样重新配置了

lvl0, lvl1 = s0.index.levels
_, indexer = i1.reindex(lvl0, tolerance=tol, method='nearest')
newlvl0 = i1[indexer]
msklvl0 = newlvl0[indexer != -1]

newidx = s0.index.set_levels([newlvl0, lvl1])
s0.set_axis(newidx, inplace=False).loc[msklvl0]

date        ID
2018-11-30  S     0
            O     0
            J     0
            H     0
            D     0
2018-12-31  U     2
            S     2
            A     2
            J     2
            L     2
2019-01-15  K     3
            U     3
            V     3
            S     3
            H     3
Name: stuff, dtype: int64

相关问题 更多 >