如何合并多个数据帧并使用Pandas为假人添加列?

2024-06-30 14:03:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个关于合并多个数据帧和添加一列假人的问题

现在我有两个原始的输入数据帧。第一个数据框回答的问题是“你最喜欢哪种颜色?”第二个数据框回答的问题是“在1到7的范围内,你在多大程度上不喜欢这种颜色?”

df1 = pd.DataFrame({'id': ['01','02'],
                    'like_wave_1': ['red','red'],
                    'like_wave_2': ['red','yellow']})
print(df1)

df2 = pd.DataFrame({'id': ['01','02'],
                    'dislike_wave1_yellow': ['7','2'],
                    'dislike_wave1_red':['1','1'],
                    'dislike_wave1_blue':['2','7'],
                    'dislike_wave2_yellow': ['7','1'],
                    'dislike_wave2_red':['1','2'],
                    'dislike_wave2_blue':['3','7']})
print(df2)

下面的dataframe构造了预期输出dataframe的概要

list_id = ['01','02']
list_color = ['yellow','red','blue']
list_wave = ['1','2']
expand = list(product(list_id, list_color, list_wave))
df = pd.DataFrame.from_records(expand, columns=['id', 'color', 'wave'])
print(df)
    id   color wave
0   01  yellow    1
1   01  yellow    2
2   01     red    1
3   01     red    2
4   01    blue    1
5   01    blue    2
6   02  yellow    1
7   02  yellow    2
8   02     red    1
9   02     red    2
10  02    blue    1
11  02    blue    2

我想在df中添加两列:

(1)“like”:一个列,用于显示特定波形中是否显示了特定id的颜色(1表示是,0表示否)

(2)“不喜欢”

因此,我期望的数据帧是:

    id   color wave  like  dislike
0   01  yellow    1     0        7
1   01  yellow    2     0        7
2   01     red    1     1        1
3   01     red    2     1        1
4   01    blue    1     0        2
5   01    blue    2     0        3
6   02  yellow    1     0        2
7   02  yellow    2     1        1
8   02     red    1     1        1
9   02     red    2     0        2
10  02    blue    1     0        7
11  02    blue    2     0        7

你能帮我解决这个问题吗?非常感谢您的回答


Tags: 数据iddataframeblueredwavelistlike
2条回答

在合并之前,我们可以使用pivot_longerfrom pyjanitor来重塑各个数据帧:

left = (df1.pivot_longer('id', 
                         names_to=('.value', 'num'), 
                         names_pattern=r".+_(.+)_(\d$)")
           .rename(columns={"wave":"color",
                            "num":"wave"})
           .assign(like = 1)
         )

left
 
   id wave   color  like
0  01    1     red     1
1  02    1     red     1
2  01    2     red     1
3  02    2  yellow     1


right = (df2.pivot_longer('id',
                          names_to=(".value", "dislike", "color"), 
                          names_pattern = r".+_(.+)(\d)_(.+)", 
                          sort_by_appearance=True)
           .rename(columns = {"dislike":"wave", "wave":"dislike"})
          )

right
 
    id wave   color dislike
0   01    1  yellow       7
1   01    1     red       1
2   01    1    blue       2
3   01    2  yellow       7
4   01    2     red       1
5   01    2    blue       3
6   02    1  yellow       2
7   02    1     red       1
8   02    1    blue       7
9   02    2  yellow       1
10  02    2     red       2
11  02    2    blue       7

right.merge(left, how = 'outer').fillna(0)

    id wave   color dislike  like
0   01    1  yellow       7   0.0
1   01    1     red       1   1.0
2   01    1    blue       2   0.0
3   01    2  yellow       7   0.0
4   01    2     red       1   1.0
5   01    2    blue       3   0.0
6   02    1  yellow       2   0.0
7   02    1     red       1   1.0
8   02    1    blue       7   0.0
9   02    2  yellow       1   1.0
10  02    2     red       2   0.0
11  02    2    blue       7   0.0

尝试将两个帧转换为与另一帧兼容的格式:

DF1

# Get df1 into usable format
df1 = df1.set_index('id')
# Create Multi Index by splitting columns on '_'
df1.columns = df1.columns.str.split('_', expand=True)
# Stack to create long format frame
df1 = df1.stack().reset_index()
# Fix column names to match df2/output
df1.columns = ['id', 'wave', 'color']
# Set like to 1 for these since this table indicates likes
df1['like'] = 1

df1

   id wave   color  like
0  01    1     red     1
1  01    2     red     1
2  02    1     red     1
3  02    2  yellow     1

DF2

# Get df2 into usable format
# Set index to ID
df2 = df2.set_index('id')
# Create Multi Index by splitting columns on '_'
df2.columns = df2.columns.str.split('_', expand=True)
# Stack to create long format frame
df2 = df2.stack(level=[1, 2]).reset_index()
# Fix column names to match df1
df2.columns = ['id', 'wave', 'color', 'dislike']
# Turn "wave1" into 1, "wave2" into 2, ... etc.
df2['wave'] = df2['wave'].str.lstrip('wave')

df2

    id wave   color dislike
0   01    1    blue       2
1   01    1     red       1
2   01    1  yellow       7
3   01    2    blue       3
4   01    2     red       1
5   01    2  yellow       7
6   02    1    blue       7
7   02    1     red       1
8   02    1  yellow       2
9   02    2    blue       7
10  02    2     red       2
11  02    2  yellow       1

然后merge将帧放在一起:

# Merge On Common Columns
df3 = df1.merge(df2, on=['id', 'wave', 'color'], how='outer')

# Fill empty values in like and dislike with 0 (only 1s in source DF1)
# (Fill dislikes in case there are likes in df1 that are not dislikes in df2)
df3[['like', 'dislike']] = df3[['like', 'dislike']].fillna(0).astype(int)

# Sort Values and fix index (to match output in question)
df3 = df3.sort_values(
    ['id', 'color'], ascending=[True, False]
).reset_index(drop=True)

df3

    id wave   color  like dislike
0   01    1  yellow     0       7
1   01    2  yellow     0       7
2   01    1     red     1       1
3   01    2     red     1       1
4   01    1    blue     0       2
5   01    2    blue     0       3
6   02    1  yellow     0       2
7   02    2  yellow     1       1
8   02    1     red     1       1
9   02    2     red     0       2
10  02    1    blue     0       7
11  02    2    blue     0       7

相关问题 更多 >