如何提高 pandas 合并列表类型列的性能

2024-09-28 01:31:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我无法merge-ing两个熊猫数据帧。你知道吗

我有两个类似的数据帧:

团队

         date  team_member_1  team_member_2
0  2017-11-21              1              6
1  2017-11-21              2              7
2  2017-11-21              3              8
3  2017-11-21              4              9
4  2017-11-21              5             10
5  2018-01-01              1             10
6  2018-01-01              2              9
7  2018-01-01              3              8
8  2018-01-01              4              7
9  2018-01-01              5              6

名称:

         date designation      ids
0  2017-11-21           a  [1, 10]
1  2017-11-21           b   [2, 9]
2  2017-11-21           c   [3, 8]
3  2017-11-21           d   [4, 7]
4  2017-11-21           e   [5, 6]
5  2018-01-01           f   [1, 2]
6  2018-01-01           g   [3, 4]
7  2018-01-01           h   [5, 6]
8  2018-01-01           i   [7, 8]
9  2018-01-01           j  [9, 10]

现在我需要将列team_member_1_designation添加到teams表中。我的方法是首先将designations表分解为如下所示,并将其与datemember_id上的teams合并:

          date designation  id
0   2017-11-21           a   1
1   2017-11-21           a  10
2   2017-11-21           b   2
3   2017-11-21           b   9
4   2017-11-21           c   3
5   2017-11-21           c   8
6   2017-11-21           d   4
7   2017-11-21           d   7
8   2017-11-21           e   5
9   2017-11-21           e   6
10  2018-01-01           f   1
11  2018-01-01           f   2
12  2018-01-01           g   3
13  2018-01-01           g   4
14  2018-01-01           h   5
15  2018-01-01           h   6
16  2018-01-01           i   7
17  2018-01-01           i   8
18  2018-01-01           j   9
19  2018-01-01           j  10

我为分解designations表编写的代码是:

designations.set_index(designations.columns.drop('ids', 1).tolist()).ids.apply(pd.Series).stack().reset_index().rename(columns={0: 'id'})

但是,当表格庞大时,这种爆炸操作需要很长时间(假设我每天都有5万个团队/团队成员的指定和团队,为期20年)

有没有更便宜的方法将team_member_1_designation列添加到teams表而不分解designations表?你知道吗


Tags: columns数据方法ididsdateindexmerge
1条回答
网友
1楼 · 发布于 2024-09-28 01:31:04

您可以使用^{}

#create dictionary with keys created by tuples
z = zip(designations['date'], designations['designation'], designations['ids'])
d = {(i, x):j for i, j, k in z for x in k}
d = {('2017-11-21', 1): 'a', ('2017-11-21', 10): 'a', ('2017-11-21', 2): 'b', 
     ('2017-11-21', 9): 'b', ('2017-11-21', 3): 'c', ('2017-11-21', 8): 'c', 
     ('2017-11-21', 4): 'd', ('2017-11-21', 7): 'd', ('2017-11-21', 5): 'e', 
     ('2017-11-21', 6): 'e', ('2018-01-01', 1): 'f', ('2018-01-01', 2): 'f', 
     ('2018-01-01', 3): 'g', ('2018-01-01', 4): 'g', ('2018-01-01', 5): 'h', 
     ('2018-01-01', 6): 'h', ('2018-01-01', 7): 'i', ('2018-01-01', 8): 'i', 
     ('2018-01-01', 9): 'j', ('2018-01-01', 10): 'j'}

#convert 2 columns to tuples
s =  pd.Series(list(map(tuple, teams[['date','team_member_1']].values.tolist())))
print (s)
0    (2017-11-21, 1)
1    (2017-11-21, 2)
2    (2017-11-21, 3)
3    (2017-11-21, 4)
4    (2017-11-21, 5)
5    (2018-01-01, 1)
6    (2018-01-01, 2)
7    (2018-01-01, 3)
8    (2018-01-01, 4)
9    (2018-01-01, 5)
dtype: object

teams['id'] = s.map(d)
print (teams)

         date  team_member_1  team_member_2 id
0  2017-11-21              1              6  a
1  2017-11-21              2              7  b
2  2017-11-21              3              8  c
3  2017-11-21              4              9  d
4  2017-11-21              5             10  e
5  2018-01-01              1             10  f
6  2018-01-01              2              9  f
7  2018-01-01              3              8  g
8  2018-01-01              4              7  g
9  2018-01-01              5              6  h

我认为.apply(pd.Series)是不推荐的,如果需要良好的性能解决方案。你知道吗

更好的方法是使用DataFrame构造函数:

cols = designations.columns.difference(['ids']).tolist()
df1 = designations.set_index(cols)['ids']

df2 = pd.DataFrame(df1.values.tolist(), index=df1.index).stack().reset_index(name='id')

或numpy解决方案:

from itertools import chain

idx = designations.index.repeat(designations['ids'].str.len())

df2 =(designations.reindex(idx)
         .assign(id=list(chain.from_iterable(designations['ids'].tolist())))
         .drop('ids', axis=1))

teams = teams.merge(df2.rename(columns={'id':'team_member_1'}), 
                    on=['date','team_member_1'], 
                    how='left')
print (teams)
         date  team_member_1  team_member_2 designation
0  2017-11-21              1              6           a
1  2017-11-21              2              7           b
2  2017-11-21              3              8           c
3  2017-11-21              4              9           d
4  2017-11-21              5             10           e
5  2018-01-01              1             10           f
6  2018-01-01              2              9           f
7  2018-01-01              3              8           g
8  2018-01-01              4              7           g
9  2018-01-01              5              6           h

相关问题 更多 >

    热门问题