基于优先级映射数据帧列

2条回答

网友

1楼 · 编辑于 2024-10-03 15:25:58

使用SQL引擎如何处理复杂查询的概念。生成部分笛卡尔积，然后将结果过滤到所需集

在这种情况下，计算匹配（重量）

result_df = (main_df
 # minimum match - just A, effectivaly a partial catersian producr
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # calculate weight of match.  NaN is neutral, different penalise, same reward
 .assign(
     B_w=lambda dfa: np.where(dfa["B_m"].isna(), 0, np.where(dfa["B"]==dfa["B_m"],1,-1)*1000),
     X_w=lambda dfa: np.where(dfa["X_m"].isna(), 0, np.where(dfa["X"]==dfa["X_m"],1,-1)*100),
     Y_w=lambda dfa: np.where(dfa["Y_m"].isna(), 0, np.where(dfa["Y"]==dfa["Y_m"],1,-1)*10),
     w=lambda dfa: dfa.loc[:,["B_w","X_w","Y_w"]].sum(axis=1)
 )
# biggest weight is one we want
 .sort_values(["A","B","X","Y","w"], ascending=[True,True,True,True,False])

 .groupby(["A","B","X","Y"]).first()
#  # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

在这种情况下，它是可用的最长密钥

main_df = pd.read_csv(io.StringIO("""     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01"""), sep="\s+")
map_df = pd.read_csv(io.StringIO("""     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05"""), sep="\s+")

# result_df = 
result_df = (main_df
 # minimum match - just A, effectively a partial catersian product
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # simpler to use empty string for NaN
 .fillna("")
 # synthetic column of rest of key and length
 .assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
                l=lambda dfa: dfa["c"].str.len())
 # longest synthetic key is one we want
 .sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
 .groupby(["A","B","X","Y"]).first()
 # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

产出（采用加权法）

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogabc  0.8  uuid01  nx05
5  dog  dog2  dogpqr  0.2  uuid02  nx05

网友

2楼 · 编辑于 2024-10-03 15:25:58

这将解决问题。不过不太漂亮

它是一种迭代算法，先向前遍历数据，然后向后遍历数据

向前传递以逐渐降低的优先级解决合并，从而消除数据结构中的匹配项。它使用inner merge策略

向后传递将输出从最低优先级更新为最高优先级。我注意到最后的update调用有点慢

merge_on = ['A', 'B', 'X', 'Y']
tot_cols = len(merge_on)

operation_main_df = main_df.copy()
outputs = []

# forward pass on progressively smaller sets
for first_idx in range(len(merge_on)):
    
    merged_id2 = operation_main_df.merge(map_df, how='inner', 
           on=merge_on[0:tot_cols-first_idx],right_index=True,suffixes={"","_y"})                                                             
   
    # the relevant output has the right number of NaN columns, the rest is garbage.
    outputs.append(merged_id2[merged_id2.isna().sum(axis=1)==first_idx])


# backward updating pass
reverse_it = iter(outputs[::-1])

out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
for el in reverse_it:
    out.update(el)

输出：


A   B   X   Y   Id1 Id2
0   cat cat1    catabc  0.1 uuid01  nx02
1   cat cat1    catxyz  0.4 uuid02  nx01
2   cat cat2    catpqr  0.5 uuid01  nx03
3   dog dog1    dogxyz  0.3 uuid03  nx04
4   dog dog2    dogpqr  0.2 uuid02  nx05
5   dog dog2    dogabc  0.8 uuid01  nx05

编辑：在我的机器上加速10倍

# change
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']]

# into
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()

产出（采用加权法）

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于优先级映射数据帧列

产出（采用加权法）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >