大Pandas群体自举抽样

df = pd.DataFrame({ 'personid': [1, 1, 1, 2, 2, 3, 3, 3, 3], 'month': ['Jan', 'Feb', 'Mar', 'Aug', 'Sep', 'Mar', 'Apr', 'May', 'Jun'], 'values': [100, 200, 300, 400, 500, 600, 700, 800, 900], }) df month personid value 0 Jan 1 100 1 Feb 1 200 2 Mar 1 300 3 Aug 2 400 4 Sep 2 500 5 Mar 3 600 6 Apr 3 700 7 May 3 800 8 Jun 3 900

month personid value 0 Mar 3 600 1 Apr 3 700 2 May 3 800 3 Jun 3 900 4 Mar 3 600 5 Apr 3 700 6 May 3 800 7 Jun 3 900 8 Aug 2 400 9 Sep 2 500

def create_bootstrapped_df(df, sampled_personids): """ Create "Block" Bootstrapped DataFrame given a vector of sampled_personids Keyword Args: df: DataFrame containing cost data at the personid, month level sampled_personids: A vector of personids that is already sampled with replacement. """ bootstrapped = [] for person in sampled_personids: person_df = df.loc[df.personid == person] bootstrapped.append(person_df) bootstrapped_sample = pd.concat(bootstrapped) bootstrapped_sample.reset_index(drop=True, inplace=True) return bootstrapped_sample

2条回答

网友

1楼 · 编辑于 2024-09-30 14:20:22

您可以使用merge。首先用随机的personids创建一个bootstrapped_df：

bootstrapped_df = pd.DataFrame({'personid':np.random.choice( personids, size=personids.size, 
                                                             replace=True)})

对我来说，它是：

^{pr2}$

然后将merge与参数how='left'一起使用：

bootstrapped_df = bootstrapped_df.merge(df,how='left')

我得到了bootstrapped_df：

   personid month  values
0         2   Aug     400
1         2   Sep     500
2         1   Jan     100
3         1   Feb     200
4         1   Mar     300
5         1   Jan     100
6         1   Feb     200
7         1   Mar     300

编辑您可以在一行中完成所有操作：

bootstrapped_df = (pd.DataFrame({'personid':np.random.choice( personids, size=personids.size, 
                                                             replace=True)})
                     .merge(df,how='left'))

网友

2楼 · 编辑于 2024-09-30 14:20:22

实际上，我只是想出了一个很简单的方法。如果我将personid设置为索引，那么我可以按索引对DataFrame进行子集设置，它将执行我想要的操作。在

例如，如果我这样做：

sampled_personids = np.random.choice(personids, size=personids.size, replace=True)

这让我

^{pr2}$

如果我这么做了：

df.loc[sampled_personids]

我得到：

          month personid value
personid
1         Jan   1        100
1         Feb   1        200
1         Mar   1        300
2         Aug   2        400
2         Sep   2        500
2         Aug   2        400
2         Sep   2        500

相关问题更多 >

编程相关推荐

热门问题

热门文章