基于列表列连接两个数据帧

2条回答

网友
1楼 · 编辑于 2024-10-03 19:24:55

如果您使用的是熊猫1.2.0或更新版本（于2020年12月26日发布），笛卡尔积（十字接头）可以简化如下：
df = df1.merge(df2, how='cross') # simplified cross joint for pandas >= 1.2.0
另外，如果系统性能（执行时间）是您关心的问题，建议使用list(map... 而不是较慢的apply(... axis=1)
使用apply(... axis=1)：
%%timeit df['overlap'] = df.apply(lambda x: len(set(x['ColumnB1']).intersection( set(x['ColumnB2']))), axis=1) 800 µs ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
使用list(map(...时：
%%timeit df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2'])) 217 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
请注意，使用list(map...的速度快了3倍
整套代码供您参考：
data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]} df1 = pd.DataFrame(data) data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]} df2 = pd.DataFrame(data) df = df1.merge(df2, how='cross') # for pandas version >= 1.2.0 df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2'])) df = df[df['overlap'] >= 2] print (df)

网友
2楼 · 编辑于 2024-10-03 19:24:55

使用行的笛卡尔乘积并检查每行
代码是在线记录的
df1 = pd.DataFrame( { 'ColumnA': ['id1', 'id2'], 'ColumnB': [['a','b','c'], ['a','d','e']], } ) df2 = pd.DataFrame( { 'ColumnA': ['id3'], 'ColumnB': [['a','b','c','x','y', 'z']], } ) # Take cartesian product of both dataframes df1['k'] = 0 df2['k'] = 0 df = pd.merge(df1, df2, on='k').drop('k',1) # Check the overlap of the lists and find the overlap length df['overlap'] = df.apply(lambda x: len(set(x['ColumnB_x']).intersection( set(x['ColumnB_y']))), axis=1) # Select whoes overlap length > 2 df = df[df['overlap'] > 2] print (df)
输出：
ColumnA_x ColumnB_x ColumnA_y ColumnB_y overlap 0 id1 [a, b, c] id3 [a, b, c, x, y, z] 3

相关问题更多 >

编程相关推荐

热门问题

热门文章