我是《熊猫》中groupby方法的不速之客,似乎无法将我的头放在它身上。我有大约200万条记录的数据,我当前的代码需要4天才能执行——因为“append”的使用效率很低
我正在用两个标志分析来自制造业的数据,以指示试样的问题每个测试ID的前几个标志应设置为False。(原因:没有足够的数据来准确分析每组的前几个标志)
我的尝试效率低下(结果正确,但速度不够快,无法容纳2M行):
df = pd.DataFrame({'Test_ID' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar'],
'TEST_Date' : ['2020-01-09 09:49:31',
'2020-01-09 12:16:15',
'2020-01-09 12:47:44',
'2020-01-09 14:39:05',
'2020-01-09 17:39:47',
'2020-01-09 20:44:58',
'2020-01-10 18:40:47'],
'Flag1' : [True, False, True, False, True, False, False],
'Flag2' : [True, False, False, False, True, False, False],
})
#generate a list of Test_IDs
Test_IDs = list(df['Test_ID'].unique())
#generate a list of columns in the dataframe
cols = list(df)
#generate a new dataframe with the same columns as the original
df_output = pd.DataFrame(columns = cols)
for i in Test_IDs:
#split the data into groups, iterate over each group
df_2 = df[df['Test_ID'] == i].copy()
#set the first two rows of Flag1 to False for each group
df_2.iloc[:2, df_2.columns.get_loc('Flag1')] = 0
#set the first three rows of Flag2 to False for each group
df_2.iloc[:3, df_2.columns.get_loc('Flag2')] = 0
df_output = df_output.append(df_2) #add the latest group onto the output df
print(df_output)
输入:
Flag1 Flag2 TEST_Date Test_ID
0 True True 2020-01-09 09:49:31 foo
1 False False 2020-01-09 12:16:15 foo
2 True False 2020-01-09 12:47:44 foo
3 False False 2020-01-09 14:39:05 foo
4 True True 2020-01-09 17:39:47 bar
5 False False 2020-01-09 20:44:58 bar
6 False False 2020-01-10 18:40:47 bar
输出:
Flag1 Flag2 TEST_Date Test_ID
0 False False 2020-01-09 09:49:31 foo
1 False False 2020-01-09 12:16:15 foo
2 True False 2020-01-09 12:47:44 foo
3 False False 2020-01-09 14:39:05 foo
4 False False 2020-01-09 17:39:47 bar
5 False False 2020-01-09 20:44:58 bar
6 False False 2020-01-10 18:40:47 bar
让我们做
groupby().cumcount()
:输出:
相关问题 更多 >
编程相关推荐