逐列选择Pandas数据帧中的随机值问题的回答

逐列选择Pandas数据帧中的随机值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

既然你得不到更多的建议，我就试一试： 检查以下代码示例（代码注释中的解释）： <pre><code>import pandas as pd import numpy as np from io import StringIO str = """userID dayID feature0 feature1 feature2 feature3 xy1 0 24 15.3 41 43 xy1 1 5 24.0 34 40 xy1 2 30 7.0 8 10 gh3 0 50 4.0 11 12 gh3 1 49 3.0 59 11 gh3 2 4 9.0 12 15 """ df = pd.read_table(StringIO(str), sep='\s+') def randx(dfg): # create a list of row-indices and make sure 0,1,2 are all in so that # all dayIDs are covered and the last one is randomly selected from [0,1,2] x = [ 0, 1, 2, np.random.randint(3) ] # shuffle the list of row-indices np.random.shuffle(x) # enumerate list-x, with the row-index and the counter aligned with the column-index, # to retrieve the actual element in the dataframe. the 2 in enumerate # is to skip the first two columns which are 'userID' and 'dayID' return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ]) ## you can also return the list of result into one column # return [ dfg.iat[j,i] for i,j in enumerate(x,2) ] def feature_name(x): return 'feature{}'.format(x) # if you have many irrelevant columns, then # retrieve only columns required for calculations # if you have 1000+ columns(features) and all are required # skip the following line, you might instead split your dataframe using slicing, # i.e. putting 200 features for each calculation, and then merge the results new_df = df[[ "userID", "dayID", *map(feature_name, [0,1,2,3]) ]] # do the calculations d1 = (new_df.groupby('userID') .apply(randx) # comment out the following .rename() function if you want to # return list instead of Series .rename(feature_name, axis=1) ) print(d1) ## feature0 feature1 feature2 feature3 userID gh3 4.0 9.0 59.0 12.0 xy1 24.0 7.0 34.0 10.0 </code></pre> 更多想法： <ol> <li>在运行apply（randx）之前，可以给出满足要求的随机行索引列表。例如，如果所有userID都具有相同数量的dayIDs，则可以使用一个列表列表来预置这些行索引。你也可以使用列表字典。在 提醒：如果您使用list of list和L.pop（）生成行索引，请确保列表的数量至少应为number of unique userID+1，因为GroupBy.apply（）在第一个组上调用其函数两次</li> <li>而不是返回pd系列（）在函数randx（）中，可以直接返回一个列表（请参见函数randx（）中带注释的return）。在这种情况下，所有检索到的特性将保存在一列中（见下文），您可以稍后对它们进行规范化。在 <pre><code>userID gh3 [50, 3.0, 59, 15] xy1 [30, 7.0, 34, 43] </code></pre></li> <li>如果它仍然运行缓慢，可以将1000多个列（特性）分成组，即每次运行处理200个特性，相应地分割预定义的行索引，然后合并结果。</li> </ol> 更新：在虚拟机（Debian-8，2GB RAM，1个CPU）上的示例测试下面： <pre><code>N_users = 100 N_days = 7 N_features = 1000 users = [ 'user{}'.format(i) for i in range(N_users) ] days = [ 'day{}'.format(i) for i in range(N_days) ] data = [] for u in users: for d in days: data.append([ u, d, *np.random.rand(N_features)]) def feature_name(x): return 'feature{}'.format(x) df = pd.DataFrame(data, columns=['userID', 'dayID', *map(feature_name, range(N_features))]) def randx_to_series(dfg): x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ] np.random.shuffle(x) return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ]) def randx_to_list(dfg): x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ] np.random.shuffle(x) return [ dfg.iat[j,i] for i,j in enumerate(x,2) ] In [133]: %timeit d1 = df.groupby('userID').apply(randx_to_series) 7.82 s +/- 202 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) In [134]: %timeit d1 = df.groupby('userID').apply(randx_to_list) 7.7 s +/- 47.2 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) In [135]: %timeit d1 = df.groupby('userID').agg(lambda x: np.random.choice(x,1)) 8.18 s +/- 31.1 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) # new test: calling np.random.choice() w/o using the lambda is much faster In [xxx]: timeit d1 = df.groupby('userID').agg(np.random.choice) 4.63 s +/- 24.7 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) </code></pre> 但是，速度与使用agg的原始方法相似(np.随机选择（）），但这个理论上是不正确的。你可能需要定义在你的期望中什么是慢的。在 有关randx_to_series（）的更多测试： <pre><code>with 2000 features, thus total 2002 columns: %%timeit %run ../../../pandas/randomchoice-2-example.py ...: 15.8 s +/- 225 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) with 5000 features, thus total 5002 columns: %%timeit %run ../../../pandas/randomchoice-2-example.py ...: 39.3 s +/- 628 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each) with 10000 features, thus 10002 columns: %%timeit %run ../../../pandas/randomchoice-2-example.py ...: 1min 21s +/- 1.73 s per loop (mean +/- std. dev. of 7 runs, 1 loop each) </code></pre> 希望这有帮助。在 环境：python3.6.4，Pandas 0.22.0

逐列选择Pandas数据帧中的随机值

1 个回答

相关Python问题