如何在python中连接具有相同列值的行?

2024-06-28 19:27:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个tweeter数据集(6000000多行),希望从中提取对话线程。让我们假设它看起来像这样:

data = pd.DataFrame({'Tweet_ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                   'Reply_to_tweet_ID': [None, '1', '1', '3', None, '1', '4', '2', '4'],
                   'Max Speed': [380., 370., 24., 26., 584., 48., 8., 123., None]})
    Tweet_ID    Reply_to_tweet_ID   Max Speed
0   1           None                380
1   2           1                   370
2   3           1                   24
3   4           3                   26
4   5           None                584
5   6           1                   48
6   7           4                   8
7   8           2                   123
8   9           4                   None

基本上,我应该将Reply_to_tweet_IDTweet_ID匹配,并将此匹配的结果连接在一行中。结果应该如下所示:

Tweet_ID1   Reply_to_tweet_ID1  Max Speed1  Tweet_ID2   Reply_to_tweet_ID2  Max Speed2  Tweet_ID3   Reply_to_tweet_ID3  Max Speed3  Tweet_ID4   Reply_to_tweet_ID4  Max Speed4
1           None                380         2           1                   370         3           1                   24          6           1                   48
2           1                   370         8           2                   123
3           1                   24          4           3                   26
4           3                   26          7           4                   8           9           4                   None

有人问了一个类似的问题,但答案不是真的

我的代码是:

df = data.set_index(['Reply_to_tweet_ID', data.groupby('Reply_to_tweet_ID')\
.cumcount().add(1)])[['Tweet_ID','Max Speed']]\
.unstack().reset_index()

df.columns = ["{}{}".format(a, b) for a, b in df.columns]

df = df[df.Reply_to_tweet_ID != 'None']

但结果是这样的:

    Reply_to_tweet_ID   Tweet_ID1   Tweet_ID2   Tweet_ID3   Max Speed1  Max Speed2  Max Speed3
0                   1           2           3           6          370          24          48
1                   2           8           NaN         NaN        123          NaN                 NaN
2                   3           4           NaN         NaN        26           NaN                 NaN
3                   4           7           9           NaN        8           None                 NaN


Tags: tononeiddfdatananreplymax
1条回答
网友
1楼 · 发布于 2024-06-28 19:27:32

IIUC,您可以通过使用merge的“自联接”来实现这一点,然后重塑数据帧,展平多索引列标题并merge返回原始数据帧:

#Create merged dataframe, data_m, to join Reply_to_tweet_ID to Tweet_ID
data_m = data[['Tweet_ID']].merge(data[['Reply_to_tweet_ID','Max Speed','Tweet_ID']], 
                                               left_on='Tweet_ID', 
                                               right_on='Reply_to_tweet_ID',
                                              suffixes=('','_y'))

#Use `set_index` with `groupby` and `cumcount` then `unstack` to 
#reshape long to wide for dataframe, data_u
data_u = data_m.set_index(['Tweet_ID', data_m.groupby('Tweet_ID').cumcount()+1]).unstack()
data_u = data_u.sort_index(axis=1, level=1)

#Flatten multiindex column header using list comprenhension    
data_u.columns = [f'{i}{j}' for i, j in data_u.columns]

#merge dataframe, data_u to the orginal dataframe, data
print(data.merge(data_u, on='Tweet_ID'))

输出:

  Tweet_ID Reply_to_tweet_ID  Max Speed  Max Speed1 Reply_to_tweet_ID1 Tweet_ID_y1  Max Speed2 Reply_to_tweet_ID2 Tweet_ID_y2  Max Speed3 Reply_to_tweet_ID3 Tweet_ID_y3
0        1              None      380.0       370.0                  1           2        24.0                  1           3        48.0                  1           6
1        2                 1      370.0       123.0                  2           8         NaN                NaN         NaN         NaN                NaN         NaN
2        3                 1       24.0        26.0                  3           4         NaN                NaN         NaN         NaN                NaN         NaN
3        4                 3       26.0         8.0                  4           7         NaN                  4           9         NaN                NaN         NaN

相关问题 更多 >