检查数据帧中列内的术语相似性

2024-10-04 03:16:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要检查列表中的两个或更多单词是否相似。 为此,我使用Jaro-Crenker距离,如下所示:

from similarity.jarowinkler import JaroWinkler

word1='sweet chili'
word2='sriracha chilli'

jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))

它似乎能够检测单词之间的相似性,但我需要设置一个阈值,仅选择80%相似的单词。 然而,我的困难在于检查数据框列中的所有单词:

Words

sweet chili
sriracha chilli
tomato
mayonnaise 
water
milk
still water
sparkling water
wine
chicken 
beef
...

我想做的是: -从第一个元素开始,检查此元素与其他元素之间的相似性;如果相似度大于阈值(80%),则将其保存在新数组中; -如上所述,检查第二个元素(sriracha辣椒); -等等

你能告诉我如何运行这样一个类似的循环吗


Tags: 元素列表阈值相似性单词sweetwatersimilarity
1条回答
网友
1楼 · 发布于 2024-10-04 03:16:38
  • 用给定的数据
  • 使用^{}
  • 如果实际数据文件有很多列,考虑只使用^ {< CD2> }列来创建数据文件
    • new_df = pd.DataFrame({'Words': df.Words})
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from similarity.jarowinkler import JaroWinkler
import numpy as np

df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})

# call similarity method
jarowinkler = JaroWinkler()

# remove whitespace
df.Words = df.Words.str.strip()

# create column of matching values for each word
words = df.Words.tolist()

for word in words:
    df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))

|    | Words           |   sweet chili |   sriracha chilli |   tomato |   mayonnaise |    water |     milk |   still water |   sparkling water |     wine |   chicken |     beef |
| -:|:        |       :|         :|    -:|      -:|    -:|    -:|       :|         :|    -:|     :|    -:|
|  0 | sweet chili     |      1        |          0.605772 | 0.419192 |     0.39697  | 0.513131 | 0        |      0.515152 |          0.460101 | 0.560606 |  0.322511 | 0.560606 |
|  1 | sriracha chilli |      0.605772 |          1        | 0.411111 |     0.388889 | 0.344444 | 0.438889 |      0.460101 |          0.488889 | 0.438889 |  0.529365 | 0        |
|  2 | tomato          |      0.419192 |          0.411111 | 1        |     0.488889 | 0.411111 | 0.472222 |      0.590909 |          0.411111 | 0        |  0        | 0        |
|  3 | mayonnaise      |      0.39697  |          0.388889 | 0.488889 |     1        | 0.433333 | 0.45     |      0.460606 |          0.544444 | 0.45     |  0.328571 | 0        |
|  4 | water           |      0.513131 |          0.344444 | 0.411111 |     0.433333 | 1        | 0        |      0.430303 |          0.511111 | 0.633333 |  0.447619 | 0.483333 |
|  5 | milk            |      0        |          0.438889 | 0.472222 |     0.45     | 0        | 1        |      0.560606 |          0.538889 | 0.5      |  0.595238 | 0        |
|  6 | still water     |      0.515152 |          0.460101 | 0.590909 |     0.460606 | 0.430303 | 0.560606 |      1        |          0.749854 | 0.44697  |  0.489177 | 0        |
|  7 | sparkling water |      0.460101 |          0.488889 | 0.411111 |     0.544444 | 0.511111 | 0.538889 |      0.749854 |          1        | 0.544444 |  0.431746 | 0        |
|  8 | wine            |      0.560606 |          0.438889 | 0        |     0.45     | 0.633333 | 0.5      |      0.44697  |          0.544444 | 1        |  0.595238 | 0.5      |
|  9 | chicken         |      0.322511 |          0.529365 | 0        |     0.328571 | 0.447619 | 0.595238 |      0.489177 |          0.431746 | 0.595238 |  1        | 0        |
| 10 | beef            |      0.560606 |          0        | 0        |     0        | 0.483333 | 0        |      0        |          0        | 0.5      |  0        | 1        |

请参阅大于80%的值

  • 除了完全匹配的值之外,没有其他值
df.set_index('Words', inplace=True)

np.where(df[words] > 0.8, df[words], np.nan)

array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan,  1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan,  1., nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan,  1., nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan,  1., nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan,  1., nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan,  1., nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan,  1., nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan,  1., nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan,  1., nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,  1.]])

添加热图

mask = np.zeros_like(df[words])
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")

enter image description here

相关问题 更多 >