首页猿问使用 pandas...

使用 pandas 为所有字符串对创建距离矩阵

Python

精慕HU 2023-07-11 16:44:52

我有一个列表，我想将其转换为距离矩阵from pylev3 import Levenshteinfrom itertools import combinationsmylist = ['foo', 'bar', 'baz', 'foo', 'foo']以下从列表中生成计算矩阵所需的所有可能的对list(combinations(mylist,2))[('foo', 'bar'), ('foo', 'baz'), ('foo', 'foo'), ('foo', 'foo'), ('bar', 'baz'), ('bar', 'foo'), ('bar', 'foo'), ('baz', 'foo'), ('baz', 'foo'), ('foo', 'foo')]然后可以通过以下方式计算每对的距离：def ld(a): return [Levenshtein.classic(*b) for b in combinations(a, 2)]ld(mylist)[3, 3, 0, 0, 1, 3, 3, 3, 3, 0]然而，我坚持在 pandas 中创建一个类似矩阵的数据框 - pandas 中有一个雄辩的解决方案吗？ foo bar baz foo foo1 foo 0 3 3 0 02 bar 3 0 1 3 33 baz 3 1 0 3 34 foo 0 3 3 0 05 foo 0 3 3 0 0

查看完整描述

2 回答

温温酱

TA贡献1752条经验获得超4个赞

让我们尝试稍微修改一下函数，以便消除重复条目的计算：

from itertools import combinations, product

def ld(a):

u = set(a)

return {b:Levenshtein.classic(*b) for b in product(u,u)}

dist = ld(mylist)

(pd.Series(list(dist.values()), pd.MultiIndex.from_tuples(dist.keys()))

.unstack()

.reindex(mylist)

.reindex(mylist,axis=1)

)

输出：

foo bar baz foo foo

foo 0 3 3 0 0

bar 3 0 1 3 3

baz 3 1 0 3 3

foo 0 3 3 0 0

反对回复 2023-07-11

慕码人8056858

TA贡献1803条经验获得超6个赞

为了计算 Levenshtein 距离，我使用了Levenshtein模块（需要pip install python-Levenshtein ），与fuzzywuzzy配对使用。

import Levenshtein as lv

然后，当我们使用Numpy函数时，mylist必须转换为Numpy数组：

lst = np.array(mylist)

要计算整个结果，请运行：

result = pd.DataFrame(np.vectorize(lv.distance)(lst[:, np.newaxis], lst[np.newaxis, :]),
    index=lst, columns=lst)

细节：

np.vectorize(lv.distance)是lv.distance函数的矢量化版本。
(lst[:, np.newaxis], lst[np.newaxis, :])是一个numpythonic习惯用法 - 来自lst数组的“each with every”参数列表，用于连续调用上述函数。
由于Numpy向量化，整个计算运行得很快，尤其是在大数组上。
pd.DataFrame(...)将上述结果（Numpy数组）转换为具有正确索引和列名称的 DataFrame。
如果需要，请使用原始函数而不是lv.distance。

结果是：

foo bar baz foo foo

foo 0 3 3 0 0

bar 3 0 1 3 3

baz 3 1 0 3 3

foo 0 3 3 0 0

反对回复 2023-07-11

2 回答
0 关注
318 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用 pandas 为所有字符串对创建距离矩阵

使用 pandas 为所有字符串对创建距离矩阵

2 回答

添加回答