逐列选择Pandas数据帧中的随机值

2条回答

网友

1楼 · 编辑于 2024-09-27 07:34:55

既然你得不到更多的建议，我就试一试：

检查以下代码示例（代码注释中的解释）：

import pandas as pd
import numpy as np
from io import StringIO

str = """userID  dayID  feature0  feature1  feature2  feature3
xy1      0        24      15.3        41        43
xy1      1         5      24.0        34        40
xy1      2        30       7.0         8        10
gh3      0        50       4.0        11        12
gh3      1        49       3.0        59        11
gh3      2         4       9.0        12        15
"""

df = pd.read_table(StringIO(str), sep='\s+')

def randx(dfg):
    # create a list of row-indices and make sure 0,1,2 are all in so that  
    # all dayIDs are covered and the last one is randomly selected from [0,1,2]
    x = [ 0, 1, 2, np.random.randint(3) ]

    # shuffle the list of row-indices
    np.random.shuffle(x)

    # enumerate list-x, with the row-index and the counter aligned with the column-index,
    # to retrieve the actual element in the dataframe. the 2 in enumerate 
    # is to skip the first two columns which are 'userID' and 'dayID'
    return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])

    ## you can also return the list of result into one column
#    return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]

def feature_name(x):
    return 'feature{}'.format(x)

# if you have many irrelevant columns, then
# retrieve only columns required for calculations
# if you have 1000+ columns(features) and all are required
# skip the following line, you might instead split your dataframe using slicing,  
# i.e. putting 200 features for each calculation, and then merge the results
new_df = df[[ "userID", "dayID", *map(feature_name, [0,1,2,3]) ]]

# do the calculations
d1 = (new_df.groupby('userID')
            .apply(randx)
            # comment out the following .rename() function if you want to 
            # return list instead of Series
            .rename(feature_name, axis=1)
     )

print(d1)
##
        feature0  feature1  feature2  feature3
userID                                        
gh3          4.0       9.0      59.0      12.0
xy1         24.0       7.0      34.0      10.0

更多想法：

在运行apply（randx）之前，可以给出满足要求的随机行索引列表。例如，如果所有userID都具有相同数量的dayIDs，则可以使用一个列表列表来预置这些行索引。你也可以使用列表字典。在
提醒：如果您使用list of list和L.pop（）生成行索引，请确保列表的数量至少应为number of unique userID+1，因为GroupBy.apply（）在第一个组上调用其函数两次
而不是返回pd系列（）在函数randx（）中，可以直接返回一个列表（请参见函数randx（）中带注释的return）。在这种情况下，所有检索到的特性将保存在一列中（见下文），您可以稍后对它们进行规范化。在
```
userID
gh3    [50, 3.0, 59, 15]
xy1    [30, 7.0, 34, 43]
```
如果它仍然运行缓慢，可以将1000多个列（特性）分成组，即每次运行处理200个特性，相应地分割预定义的行索引，然后合并结果。

更新：在虚拟机（Debian-8，2GB RAM，1个CPU）上的示例测试下面：

N_users = 100
N_days = 7
N_features = 1000

users = [ 'user{}'.format(i) for i in range(N_users) ]
days  = [ 'day{}'.format(i) for i in range(N_days)   ]
data =  []
for u in users:
    for d in days:
        data.append([ u, d, *np.random.rand(N_features)])

def feature_name(x):
    return 'feature{}'.format(x)

df = pd.DataFrame(data, columns=['userID', 'dayID', *map(feature_name, range(N_features))])

def randx_to_series(dfg):
    x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
    np.random.shuffle(x)
    return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])

def randx_to_list(dfg):
    x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
    np.random.shuffle(x)
    return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]

In [133]: %timeit d1 = df.groupby('userID').apply(randx_to_series)
7.82 s +/- 202 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

In [134]: %timeit d1 = df.groupby('userID').apply(randx_to_list)
7.7 s +/- 47.2 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

In [135]: %timeit d1 = df.groupby('userID').agg(lambda x: np.random.choice(x,1))
8.18 s +/- 31.1 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

# new test: calling np.random.choice() w/o using the lambda is much faster
In [xxx]: timeit d1 = df.groupby('userID').agg(np.random.choice)
4.63 s +/- 24.7 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

但是，速度与使用agg的原始方法相似(np.随机选择（）），但这个理论上是不正确的。你可能需要定义在你的期望中什么是慢的。在

有关randx_to_series（）的更多测试：

with 2000 features, thus total 2002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
15.8 s +/- 225 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

with 5000 features, thus total 5002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
39.3 s +/- 628 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

with 10000 features, thus 10002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:     
1min 21s +/- 1.73 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)

希望这有帮助。在

环境：python3.6.4，Pandas 0.22.0

网友

2楼 · 编辑于 2024-09-27 07:34:55

我承认，我对这个解决方案有点创意。在

我不认为你发布的代码和你在问题中解释的不一样。但是，这里有一段代码，它确实按用户标识随机分配每个特性的日期。在

df.groupby('userID').apply(lambda x: x.apply(lambda x: x.sample(n=1)).ffill().bfill().head(1))

输出：

^{pr2}$

注意，这可能真的很慢，似乎一个新的解决方案可能会更快。在

在

相关问题更多 >

编程相关推荐

热门问题

热门文章

逐列选择Pandas数据帧中的随机值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >