Python,迭代和修改df字典中的数据帧

2024-09-22 10:17:21 发布

您现在位置:Python中文网/ 问答频道 /正文

来自C的背景(几年前),而且对Python非常陌生,我正在努力优化我的代码。从字面上讲,for循环非常慢。在

在下面的代码中,向Dict中的每个数据帧添加计算列的循环似乎是一个巨大的瓶颈。在

我已经读了很多关于解决这个问题的方法,比如;VectorisationNumba,但我认为我对Python了解不够,无法真正理解和使用它们。在

事实上,我对这两种方法的尝试都失败了,可能是不正确的实现,除了我用过的一个测试np.哪里. 这向我展示了我的for循环/计算有多糟糕。在

我将在我的工作示例中省略这些尝试,但如果需要,可以在以后添加:

import pandas as pd
import numpy as np
import datetime as date
import itertools

def points(row):
val = 0
if row['Ob2'] > 0.5:
    foo = row['Ob3'] - row['Ob1']
    if foo < 0.1:
        val = 1 - foo
else:
    val = 0
return val

print("Start: "+ str(date.datetime.now()))
print()

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

print("DF fill: "+ str(date.datetime.now()))
print()

#Add test calculated column
for tbl in DataFrameDict:
    DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1) #Slow loop
    #example vectorised, hugh dif is run time
    #DataFrameDict[tbl]['Test'] = np.where((DataFrameDict[tbl]['Ob2']>0.5),1,0)

print("Calc'd: "+ str(date.datetime.now()))
print()

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

print("Fin: "+ str(date.datetime.now()))
print()

编辑:该函数添加一列,该列是每个df中两个“players”的比较,因此我们/我无法将其应用于源数据源。很抱歉没有说清楚。

显然,我需要回溯并学习一些Python基础知识,但是我的老板正在等待真正的脚本,它花了3个小时运行一个标准的500个“名称”(125K个数据帧)。在

如果有人能帮我优化它,我将不胜感激!在

EDIT2:更好地表示现实世界中的问题

^{pr2}$

我的Solution由于混乱,不想在这里发帖。在


Tags: keyinimportfordatadatetimedatenp
3条回答
import pandas as pd
import numpy as np
import datetime as date
import itertools

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

data['Test'] = np.where(data['Ob2'] > 0.5, np.where(data['Ob3'] - data['Ob1'] < 0.1, 1 - (data['Ob3'] - data['Ob1']), 0), 0)


comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

我尽量保留你的代码。我把你的功能改成了np.哪里而不是apply,并在创建dict之前添加了test列,因为正如我在评论中所表达的那样,在那时执行apply没有任何意义。在

使用%%timeit时,每个循环得到26.2 s±1.15 s(平均值±标准偏差,7次运行,每个循环1次)

编辑:

这是我最快的速度:

^{pr2}$

我的目标是不使用循环或dicts来进一步提高速度。在

我的函数ScoreAndCount返回每个玩家的分数和计数。这个帕金森病获取函数的返回值并将其添加到初始df中。在

然后,我使用了itertools组合,并将其作为自己的数据帧,称为summary。然后,我将summary df的player1和player2列与原始df中的names列合并。在

下一步,我把玩家的分数和计数加起来,去掉不必要的列,然后进行排序。我最后每圈157ms。最慢的步骤是concat和merge,但是我想不出办法绕过它们,进一步提高速度。在

编辑3

我们将为两个测试设置一个种子并使用相同的数据df:

np.random.seed(0)

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*10,\
                    'Ob1' : np.random.rand(700),\
                    'Ob2' : np.random.rand(700) ,\
                    'Ob3' : np.random.rand(700)})

data.head()

      Names    Ob1         Ob2         Ob3
0   player1 0.548814    0.373216    0.313591
1   player2 0.715189    0.222864    0.365539
2   player3 0.602763    0.080532    0.201267
3   player4 0.544883    0.085311    0.487148
4   player5 0.423655    0.221396    0.990369

接下来我们将使用您的确切代码,并检查player1和player2之间的dict。在


def points(row):
    val = 0
    if row['Ob2'] > 0.5:
        foo = row['Ob3'] - row['Ob1']
        if foo < 0.1:
            val = 1 - foo
    else:
        val = 0
    return val

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

#Add test calculated column
for tbl in DataFrameDict:
    DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1)

DataFrameDict[('player1', 'player2')].head()

     Names     Ob1        Ob2          Ob3     Test
351 player2 0.035362    0.013509    0.384273    0.0
630 player1 0.062636    0.305047    0.571550    0.0
561 player2 0.133461    0.758194    0.964210    0.0
211 player2 0.216897    0.056877    0.417333    0.0
631 player2 0.241902    0.557987    0.983555    0.0

接下来,我们将执行您在摘要中所做的操作,并获取测试列的总和,这将是player1和player2生成的分数

DataFrameDict[('player1', 'player2')]['Test'].sum()

8.077455441105938

所以我们得到了8.0774。现在如果我说的是真的,如果我们在Edit2中编写代码,那么player1和player2之间的分数将是8.077。在

data['test'] = np.where(data['Ob2'] > 0.5, np.where(data['Ob3'] - data['Ob1'] < 0.1, 1 - (data['Ob3'] - data['Ob1']), 0), 0)

def ScoreAndCount(row):
    score = row.sum()
    count = row.astype(bool).sum()
    return score, count

df = data.groupby('Names')['test'].apply(ScoreAndCount).reset_index()
df = pd.concat([df['Names'], df.test.apply(pd.Series).rename(columns = {0: 'Score', 1:'Count'})], axis = 1)

summary = pd.DataFrame(itertools.combinations(data.Names.unique(), 2), columns = ['Player1', 'Player2'])
summary = summary.merge(df, left_on = 'Player1', right_on = 'Names')\
                 .merge(df, left_on = 'Player2', right_on = 'Names')\
                 .drop(columns = ['Names_x', 'Names_y'])

summary['Score'] = summary['Score_x'] + summary['Score_y']
summary['Count'] = summary['Count_x'] + summary['Count_y']
summary.drop(columns = ['Score_x','Count_x', 'Score_y','Count_y'], inplace = True)
summary = summary.sort_values('Score', ascending = False)

现在我们将使用player1和player2检查行

summary[(summary['Player1'] == 'player1')&(summary['Player2'] == 'player2')]

    Player1 Player2   Score    Count
0   player1 player2 8.077455    6.0

如您所见,我通过edit2从player1player2计算出的分数与您在代码中所做的完全相同。在

我可以用numba对函数进行矢量化,结果代码用%%timeit在~8秒内运行。我听从了本·帕普的建议,事先计算了测试柱。我还预先对值进行了排序,并整理了DataFrameDict的创建。在


%%timeit
import pandas as pd
import numpy as np
import datetime as date
import itertools
import numba

@numba.vectorize
def points(a,b,c):
    val = 0
    if b > 0.5:
        foo = c - a
        if foo < 0.1:
            val = 1 - foo
    else:
        val = 0
    return val

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

data['Test'] = points(data['Ob1'].values,data['Ob2'].values,data['Ob3'].values)
data = data.sort_values(['Ob1'])

comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : data.loc[data.Names.isin(elem)] for elem in comboNames}

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

每个回路8.52 s±204 ms(7次运行的平均值±标准偏差,每个回路1次)

我关注您的函数point和调用apply的for循环。在

函数Point可以转换成这个条件(a_dfDataFrameDict中的每个数据帧):

(a_df['Ob2'] > 0.5) & (a_df['Ob3'] - a_df['Ob1'] < 0.01)

在这种情况下,将值1 - x['Ob3'] + x['Ob1']分配给Test列。其他所有操作都将0分配给Test。所以,让我们将新列Test分配给每个a_df。然后,只过滤符合上述条件的行,以缩小数据集的范围并为此子集设置新值。最后,将这个子集Test列值更新回一个_df['Test'],并将其分配回DataFrameDict字典。所以,你的for循环将变成:

^{pr2}$

这跑得很快

输出:根据指定的条件,DataFrameDict列的每个数据帧都填充了Test列。我从DataFrameDict随机选取一个最终数据帧来显示输出。在

In [1288]: DataFrameDict[('player65', 'player67')]
Out[1288]:
          Names       Ob1       Ob2       Ob3      Test
61456  player67  0.000271  0.686051  0.729086  0.000000
25824  player65  0.001281  0.505552  0.296550  0.000000
25544  player65  0.001398  0.770805  0.471477  0.000000
65864  player65  0.001999  0.147407  0.291841  0.000000
33104  player65  0.002661  0.254329  0.126290  0.000000
42554  player65  0.003172  0.529603  0.181796  0.000000
28064  player65  0.003663  0.227429  0.558233  0.000000
24844  player65  0.005517  0.096817  0.710771  0.000000
2584   player65  0.005974  0.338904  0.582034  0.000000
42694  player65  0.005996  0.171637  0.765277  0.000000
6154   player65  0.006126  0.181239  0.295149  0.000000
65234  player65  0.008386  0.180613  0.994273  0.000000
5034   player65  0.008921  0.013060  0.305063  0.000000
21766  player67  0.010950  0.590966  0.481547  0.000000
53054  player65  0.010957  0.731794  0.262754  0.000000
15956  player67  0.010996  0.046718  0.153172  0.000000
36046  player67  0.011634  0.250039  0.064184  0.000000
50394  player65  0.011835  0.995986  0.834281  0.000000
64326  player67  0.011974  0.499262  0.745194  0.000000
30236  player67  0.013029  0.101714  0.143509  0.000000
23374  player65  0.014865  0.158185  0.575582  0.000000
1256   player67  0.014915  0.938301  0.629850  0.000000
10216  player67  0.015122  0.450750  0.137085  0.000000
21904  player65  0.016372  0.147897  0.786882  0.000000
34854  player65  0.016603  0.513692  0.676243  0.000000
33806  player67  0.016820  0.063896  0.577731  0.000000
29816  player67  0.017565  0.060496  0.151780  0.000000
6924   player65  0.017652  0.121581  0.117512  0.000000
39126  player67  0.017990  0.516819  0.663672  0.000000
39896  player67  0.018085  0.031526  0.075832  0.000000
...         ...       ...       ...       ...       ...
61526  player67  0.985386  0.512073  0.754241  1.231145
48926  player67  0.985504  0.007080  0.671456  0.000000
16234  player65  0.985775  0.846647  0.998181  0.000000
12736  player67  0.985846  0.283997  0.667314  0.000000
47874  player65  0.986084  0.052026  0.508918  0.000000
29886  player67  0.986655  0.998440  0.068136  1.918518
49416  player67  0.986706  0.833053  0.182814  1.803892
42486  player67  0.986797  0.608128  0.136219  1.850578
55644  player65  0.987796  0.215898  0.561002  0.000000
1814   player65  0.987935  0.324954  0.525433  0.000000
7554   player65  0.988910  0.664914  0.674546  1.314365
59774  player65  0.989147  0.235214  0.913588  0.000000
58444  player65  0.989467  0.645191  0.533468  1.455999
62856  player67  0.989470  0.523544  0.302838  1.686632
48646  player67  0.990588  0.522521  0.201132  1.789456
11336  player67  0.990629  0.932360  0.756544  1.234085
31774  player65  0.990881  0.981641  0.943824  1.047057
18964  player65  0.992287  0.808989  0.948321  1.043967
14486  player67  0.992909  0.437701  0.484678  0.000000
12246  player67  0.994027  0.542903  0.234830  1.759197
33596  player67  0.994257  0.949055  0.098368  1.895889
6436   player67  0.994661  0.444211  0.572136  0.000000
4194   player65  0.995022  0.721113  0.584195  1.410826
42696  player67  0.995065  0.516103  0.918737  1.076328
51026  player67  0.995864  0.877335  0.516737  1.479127
14136  player67  0.997691  0.134021  0.913969  0.000000
47664  player65  0.998051  0.628051  0.722695  1.275357
55924  player65  0.998079  0.828749  0.151217  1.846863
18474  player65  0.998780  0.200990  0.098713  0.000000
41296  player67  0.998884  0.167139  0.504899  0.000000

[2000 rows x 5 columns]

相关问题 更多 >