假设我们有两个数据帧,每个数据帧包含一列类似的基于字符串的值。基于类似于Jaro Winkler的textdistance's implementation的比较函数,匹配具有类似列的行的最有效和/或最有效的方法是什么
数据帧示例:
first_df = pd.DataFrame( ['Cars and cats', 'Spaceship', 'Captain Marvel', 'Dune','Bucks in 6'], columns=['Title'])
second_df = pd.DataFrame( ['Captain Harlock', 'Cats and dogs', 'Buccuneers', 'Dune buggy','Milwaukee Bucks'], columns=['Title'])
我想的是:
实施:
comparison_df = first_df.merge(second_df, how='cross')
comparison_df['similarity_score'] = comparison_df.apply(lambda row: textdistance.jaro_winkler.normalized_similarity(row['First DataFrame Titles'], row['Second DataFrame Titles']), axis=1)
display(comparison_df)
comparison_df = comparison_df.sort_values('similarity_score', ascending=False).drop_duplicates(subset=['First DataFrame Titles'], keep='first')
欢迎提出任何建议。先谢谢你
目前没有回答
相关问题 更多 >
编程相关推荐