使用Pandas为所有字符串对创建距离矩阵

list(combinations(mylist,2)) [('foo', 'bar'), ('foo', 'baz'), ('foo', 'foo'), ('foo', 'foo'), ('bar', 'baz'), ('bar', 'foo'), ('bar', 'foo'), ('baz', 'foo'), ('baz', 'foo'), ('foo', 'foo')]

foo bar baz foo foo 1 foo 0 3 3 0 0 2 bar 3 0 1 3 3 3 baz 3 1 0 3 3 4 foo 0 3 3 0 0 5 foo 0 3 3 0 0

2条回答

网友

1楼 · 编辑于 2024-05-03 06:35:57

为了计算Levenshtein距离，我使用了Levenshtein模块（pip-install-python-Levenshteinrequired），与 模糊模糊

import Levenshtein as lv

然后，当我们使用Numpy函数时，mylist必须转换到Numpy阵列：

lst = np.array(mylist)

要计算整个结果，请运行：

result = pd.DataFrame(np.vectorize(lv.distance)(lst[:, np.newaxis], lst[np.newaxis, :]),
    index=lst, columns=lst)

详情：

np.vectorize(lv.distance)是lv.distance 功能
(lst[:, np.newaxis], lst[np.newaxis, :])是一个名词性的习语- 来自lst数组的参数列表，用于连续调用上述函数
由于Numpy矢量化，整个计算运行速度很快，特别是在大型阵列上可以看到什么
pd.DataFrame(...)转换上述结果（aNumpy数组）到数据帧，使用正确的索引和列名
如果需要，请使用原始功能，而不是lv.distance

结果是：

     foo  bar  baz  foo  foo
foo    0    3    3    0    0
bar    3    0    1    3    3
baz    3    1    0    3    3
foo    0    3    3    0    0
foo    0    3    3    0    0

网友

2楼 · 编辑于 2024-05-03 06:35:57

让我们尝试稍微修改一下函数，以便消除对重复条目的计算：

from itertools import combinations, product

def ld(a):
    u = set(a)
    return {b:Levenshtein.classic(*b) for b in product(u,u)}

dist = ld(mylist)

(pd.Series(list(dist.values()), pd.MultiIndex.from_tuples(dist.keys()))
   .unstack()
   .reindex(mylist)
   .reindex(mylist,axis=1)
)

输出：

     foo  bar  baz  foo  foo
foo    0    3    3    0    0
bar    3    0    1    3    3
baz    3    1    0    3    3
foo    0    3    3    0    0
foo    0    3    3    0    0

相关问题更多 >

编程相关推荐

热门问题

热门文章