文章/答案/技术大牛

发布

社区首页 >问答首页 >Damerau-Levenshtein用Python编辑距离

问Damerau-Levenshtein用Python编辑距离
EN

Data Science用户

提问于 2019-09-11 07:45:09

回答 1查看 5.2K关注 0票数 2

我通过Google在Damerau Levensthein编辑距离上找到了一些python代码，但是当我查看他们的评论时，很多人说算法是不正确的。我很困惑。

有人能在Damerau Levensthein距离上共享正确的python代码吗？

哪一个是对的？

https://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

第一个：

"""Compute the Damerau-Levenshtein distance between two given strings (s1 and s2)"""


    def damerau_levenshtein_distance(s1, s2):
    
    
        d = {}
        lenstr1 = len(s1)
        lenstr2 = len(s2)
        for i in xrange(-1,lenstr1+1):
            d[(i,-1)] = i+1
        for j in xrange(-1,lenstr2+1):
            d[(-1,j)] = j+1
    
        for i in xrange(lenstr1):
            for j in xrange(lenstr2):
                if s1[i] == s2[j]:
                    cost = 0
                else:
                    cost = 1
                d[(i,j)] = min(
                               d[(i-1,j)] + 1, # deletion
                               d[(i,j-1)] + 1, # insertion
                               d[(i-1,j-1)] + cost, # substitution
                              )
                if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                    d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
    
        return d[lenstr1-1,lenstr2-1]

我尝试了"zx“到"xyz"，算法回答3，但正确的答案是2，所以这是无效的。

第二个：

https://gist.github.com/pombredanne/0d83ad58f45986ddeb0917266e106be0

Damerau-Levenshtein编辑distane实现

基于维基百科的伪代码：https://en.wikipedia.org/wiki/Damerau-Levenshtein_距离

通过处理1加法+1缺失=转置字符之间的1替换，可能的改进：

Damerau-Levenshtein距离为"abcdef“和"abcfad”= 3：

用"d“代替"f”
用"e“代替"a”
用"f“代替"d”

或者另一种选择：

转置"d“和"f”
删除"a“
插入"e“

很明显，第二次分析中的(2)和(3)实际上只是一个替代：

转置"d“和"f”
用"e“代替"a”

使用这个变体，"abcdef“和"abcfad”之间的距离实际上是2。

def damerau_levenshtein_distance_improved(a, b):

    # "Infinity" -- greater than maximum possible edit distance
    # Used to prevent transpositions for first characters

    INF = len(a) + len(b)

    # Matrix: (M + 2) x (N + 2)
    matrix  = [[INF for n in xrange(len(b) + 2)]]
    matrix += [[INF] + range(len(b) + 1)]
    matrix += [[INF, m] + [0] * len(b) for m in xrange(1, len(a) + 1)]

    # Holds last row each element was encountered: DA in the Wikipedia pseudocode
    last_row = {}

    # Fill in costs
    for row in xrange(1, len(a) + 1):
        # Current character in a
        ch_a = a[row-1]

        # Column of last match on this row: DB in pseudocode
        last_match_col = 0

        for col in xrange(1, len(b) + 1):
            # Current character in b
            ch_b = b[col-1]

            # Last row with matching character
            last_matching_row = last_row.get(ch_b, 0)

            # Cost of substitution
            cost = 0 if ch_a == ch_b else 1

            # Compute substring distance
            matrix[row+1][col+1] = min(
                matrix[row][col] + cost, # Substitution
                matrix[row+1][col] + 1,  # Addition
                matrix[row][col+1] + 1,  # Deletion

                # Transposition
                # Start by reverting to cost before transposition
                matrix[last_matching_row][last_match_col]
                    # Cost of letters between transposed letters
                    # 1 addition + 1 deletion = 1 substitution
                    + max((row - last_matching_row - 1),
                          (col - last_match_col - 1))
                    # Cost of the transposition itself
                    + 1)

            # If there was a match, update last_match_col
            if cost == 0:
                last_match_col = col

        # Update last row for current character
        last_row[ch_a] = row

    # Return last element
    return matrix[-1][-1]

这段代码不起作用。

下面维基百科中的psedo代码也不起作用。字符串不能作为k := da[bJ]和da[a我] := i的索引

algorithm DL-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    
    da := new array of |Σ| integers
    for i := 1 to |Σ| inclusive do
        da[i] := 0
    
    let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
    // note that d has indices starting at −1, while a, b and da are one-indexed.
    
    maxdist := length(a) + length(b)
    d[−1, −1] := maxdist
    for i := 0 to length(a) inclusive do
        d[i, −1] := maxdist
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[−1, j] := maxdist
        d[0, j] := j
    
    for i := 1 to length(a) inclusive do
        db := 0
        for j := 1 to length(b) inclusive do
            k := da[b[j]]
            ℓ := db
            if a[i] = b[j] then
                cost := 0
                db := j
            else
                cost := 1
            d[i, j] := minimum(d[i−1, j−1] + cost,  //substitution
                               d[i,   j−1] + 1,     //insertion
                               d[i−1, j  ] + 1,     //deletion
                               d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
        da[a[i]] := i
    return d[length(a), length(b)]

谢谢。

python

implementation

回答 1

Data Science用户

发布于 2019-10-14 07:27:59

我一直在使用以下代码，到目前为止，它对我的服务很好：

#Calculates the normalized Levenshtein distance of 2 strings
def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [list(range(l1 + 1))] * (l2 + 1)
    for zz in list(range(l2 + 1)):
      matrix[zz] = list(range(zz,zz + l1 + 1))
    for zz in list(range(0,l2)):
      for sz in list(range(0,l1)):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
    distance = float(matrix[l2][l1])
    result = 1.0-distance/max(l1,l2)
    return result

如果不需要标准化，那么应该很容易删除代码的最后部分。

票数 1

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/60019

复制

相似问题

问Damerau-Levenshtein用Python编辑距离
EN

第一个：

第二个：

Damerau-Levenshtein编辑distane实现

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Damerau-Levenshtein用Python编辑距离EN

第一个：

第二个：

Damerau-Levenshtein编辑distane实现

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Damerau-Levenshtein用Python编辑距离
EN