我通过Google在Damerau Levensthein编辑距离上找到了一些python代码,但是当我查看他们的评论时,很多人说算法是不正确的。我很困惑。
有人能在Damerau Levensthein距离上共享正确的python代码吗?
哪一个是对的?
https://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/
"""Compute the Damerau-Levenshtein distance between two given strings (s1 and s2)"""
def damerau_levenshtein_distance(s1, s2):
d = {}
lenstr1 = len(s1)
lenstr2 = len(s2)
for i in xrange(-1,lenstr1+1):
d[(i,-1)] = i+1
for j in xrange(-1,lenstr2+1):
d[(-1,j)] = j+1
for i in xrange(lenstr1):
for j in xrange(lenstr2):
if s1[i] == s2[j]:
cost = 0
else:
cost = 1
d[(i,j)] = min(
d[(i-1,j)] + 1, # deletion
d[(i,j-1)] + 1, # insertion
d[(i-1,j-1)] + cost, # substitution
)
if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
return d[lenstr1-1,lenstr2-1]我尝试了"zx“到"xyz",算法回答3,但正确的答案是2,所以这是无效的。
https://gist.github.com/pombredanne/0d83ad58f45986ddeb0917266e106be0
基于维基百科的伪代码:https://en.wikipedia.org/wiki/Damerau-Levenshtein_距离
通过处理1加法+1缺失=转置字符之间的1替换,可能的改进:
Damerau-Levenshtein距离为"abcdef“和"abcfad”= 3:
或者另一种选择:
很明显,第二次分析中的(2)和(3)实际上只是一个替代:
使用这个变体,"abcdef“和"abcfad”之间的距离实际上是2。
def damerau_levenshtein_distance_improved(a, b):
# "Infinity" -- greater than maximum possible edit distance
# Used to prevent transpositions for first characters
INF = len(a) + len(b)
# Matrix: (M + 2) x (N + 2)
matrix = [[INF for n in xrange(len(b) + 2)]]
matrix += [[INF] + range(len(b) + 1)]
matrix += [[INF, m] + [0] * len(b) for m in xrange(1, len(a) + 1)]
# Holds last row each element was encountered: DA in the Wikipedia pseudocode
last_row = {}
# Fill in costs
for row in xrange(1, len(a) + 1):
# Current character in a
ch_a = a[row-1]
# Column of last match on this row: DB in pseudocode
last_match_col = 0
for col in xrange(1, len(b) + 1):
# Current character in b
ch_b = b[col-1]
# Last row with matching character
last_matching_row = last_row.get(ch_b, 0)
# Cost of substitution
cost = 0 if ch_a == ch_b else 1
# Compute substring distance
matrix[row+1][col+1] = min(
matrix[row][col] + cost, # Substitution
matrix[row+1][col] + 1, # Addition
matrix[row][col+1] + 1, # Deletion
# Transposition
# Start by reverting to cost before transposition
matrix[last_matching_row][last_match_col]
# Cost of letters between transposed letters
# 1 addition + 1 deletion = 1 substitution
+ max((row - last_matching_row - 1),
(col - last_match_col - 1))
# Cost of the transposition itself
+ 1)
# If there was a match, update last_match_col
if cost == 0:
last_match_col = col
# Update last row for current character
last_row[ch_a] = row
# Return last element
return matrix[-1][-1]这段代码不起作用。
下面维基百科中的psedo代码也不起作用。字符串不能作为k := da[bJ]和da[a我] := i的索引
algorithm DL-distance is
input: strings a[1..length(a)], b[1..length(b)]
output: distance, integer
da := new array of |Σ| integers
for i := 1 to |Σ| inclusive do
da[i] := 0
let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
// note that d has indices starting at −1, while a, b and da are one-indexed.
maxdist := length(a) + length(b)
d[−1, −1] := maxdist
for i := 0 to length(a) inclusive do
d[i, −1] := maxdist
d[i, 0] := i
for j := 0 to length(b) inclusive do
d[−1, j] := maxdist
d[0, j] := j
for i := 1 to length(a) inclusive do
db := 0
for j := 1 to length(b) inclusive do
k := da[b[j]]
ℓ := db
if a[i] = b[j] then
cost := 0
db := j
else
cost := 1
d[i, j] := minimum(d[i−1, j−1] + cost, //substitution
d[i, j−1] + 1, //insertion
d[i−1, j ] + 1, //deletion
d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
da[a[i]] := i
return d[length(a), length(b)]谢谢。
发布于 2019-10-14 07:27:59
我一直在使用以下代码,到目前为止,它对我的服务很好:
#Calculates the normalized Levenshtein distance of 2 strings
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [list(range(l1 + 1))] * (l2 + 1)
for zz in list(range(l2 + 1)):
matrix[zz] = list(range(zz,zz + l1 + 1))
for zz in list(range(0,l2)):
for sz in list(range(0,l1)):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
distance = float(matrix[l2][l1])
result = 1.0-distance/max(l1,l2)
return result如果不需要标准化,那么应该很容易删除代码的最后部分。
https://datascience.stackexchange.com/questions/60019
复制相似问题