首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Damerau-Levenshtein用Python编辑距离

Damerau-Levenshtein用Python编辑距离
EN

Data Science用户
提问于 2019-09-11 07:45:09
回答 1查看 5.2K关注 0票数 2

我通过Google在Damerau Levensthein编辑距离上找到了一些python代码,但是当我查看他们的评论时,很多人说算法是不正确的。我很困惑。

有人能在Damerau Levensthein距离上共享正确的python代码吗?

哪一个是对的?

https://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

第一个:

代码语言:javascript
复制
"""Compute the Damerau-Levenshtein distance between two given strings (s1 and s2)"""


    def damerau_levenshtein_distance(s1, s2):
    
    
        d = {}
        lenstr1 = len(s1)
        lenstr2 = len(s2)
        for i in xrange(-1,lenstr1+1):
            d[(i,-1)] = i+1
        for j in xrange(-1,lenstr2+1):
            d[(-1,j)] = j+1
    
        for i in xrange(lenstr1):
            for j in xrange(lenstr2):
                if s1[i] == s2[j]:
                    cost = 0
                else:
                    cost = 1
                d[(i,j)] = min(
                               d[(i-1,j)] + 1, # deletion
                               d[(i,j-1)] + 1, # insertion
                               d[(i-1,j-1)] + cost, # substitution
                              )
                if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                    d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
    
        return d[lenstr1-1,lenstr2-1]

我尝试了"zx“到"xyz",算法回答3,但正确的答案是2,所以这是无效的。

第二个:

https://gist.github.com/pombredanne/0d83ad58f45986ddeb0917266e106be0

Damerau-Levenshtein编辑distane实现

基于维基百科的伪代码:https://en.wikipedia.org/wiki/Damerau-Levenshtein_距离

通过处理1加法+1缺失=转置字符之间的1替换,可能的改进:

Damerau-Levenshtein距离为"abcdef“和"abcfad”= 3:

  1. 用"d“代替"f”
  2. 用"e“代替"a”
  3. 用"f“代替"d”

或者另一种选择:

  1. 转置"d“和"f”
  2. 删除"a“
  3. 插入"e“

很明显,第二次分析中的(2)和(3)实际上只是一个替代:

  1. 转置"d“和"f”
  2. 用"e“代替"a”

使用这个变体,"abcdef“和"abcfad”之间的距离实际上是2。

代码语言:javascript
复制
def damerau_levenshtein_distance_improved(a, b):

    # "Infinity" -- greater than maximum possible edit distance
    # Used to prevent transpositions for first characters

    INF = len(a) + len(b)

    # Matrix: (M + 2) x (N + 2)
    matrix  = [[INF for n in xrange(len(b) + 2)]]
    matrix += [[INF] + range(len(b) + 1)]
    matrix += [[INF, m] + [0] * len(b) for m in xrange(1, len(a) + 1)]

    # Holds last row each element was encountered: DA in the Wikipedia pseudocode
    last_row = {}

    # Fill in costs
    for row in xrange(1, len(a) + 1):
        # Current character in a
        ch_a = a[row-1]

        # Column of last match on this row: DB in pseudocode
        last_match_col = 0

        for col in xrange(1, len(b) + 1):
            # Current character in b
            ch_b = b[col-1]

            # Last row with matching character
            last_matching_row = last_row.get(ch_b, 0)

            # Cost of substitution
            cost = 0 if ch_a == ch_b else 1

            # Compute substring distance
            matrix[row+1][col+1] = min(
                matrix[row][col] + cost, # Substitution
                matrix[row+1][col] + 1,  # Addition
                matrix[row][col+1] + 1,  # Deletion

                # Transposition
                # Start by reverting to cost before transposition
                matrix[last_matching_row][last_match_col]
                    # Cost of letters between transposed letters
                    # 1 addition + 1 deletion = 1 substitution
                    + max((row - last_matching_row - 1),
                          (col - last_match_col - 1))
                    # Cost of the transposition itself
                    + 1)

            # If there was a match, update last_match_col
            if cost == 0:
                last_match_col = col

        # Update last row for current character
        last_row[ch_a] = row

    # Return last element
    return matrix[-1][-1]

这段代码不起作用。

下面维基百科中的psedo代码也不起作用。字符串不能作为k := da[bJ]和da[a我] := i的索引

代码语言:javascript
复制
algorithm DL-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    
    da := new array of |Σ| integers
    for i := 1 to |Σ| inclusive do
        da[i] := 0
    
    let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
    // note that d has indices starting at −1, while a, b and da are one-indexed.
    
    maxdist := length(a) + length(b)
    d[−1, −1] := maxdist
    for i := 0 to length(a) inclusive do
        d[i, −1] := maxdist
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[−1, j] := maxdist
        d[0, j] := j
    
    for i := 1 to length(a) inclusive do
        db := 0
        for j := 1 to length(b) inclusive do
            k := da[b[j]]
            ℓ := db
            if a[i] = b[j] then
                cost := 0
                db := j
            else
                cost := 1
            d[i, j] := minimum(d[i−1, j−1] + cost,  //substitution
                               d[i,   j−1] + 1,     //insertion
                               d[i−1, j  ] + 1,     //deletion
                               d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
        da[a[i]] := i
    return d[length(a), length(b)]

谢谢。

EN

回答 1

Data Science用户

发布于 2019-10-14 07:27:59

我一直在使用以下代码,到目前为止,它对我的服务很好:

代码语言:javascript
复制
#Calculates the normalized Levenshtein distance of 2 strings
def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [list(range(l1 + 1))] * (l2 + 1)
    for zz in list(range(l2 + 1)):
      matrix[zz] = list(range(zz,zz + l1 + 1))
    for zz in list(range(0,l2)):
      for sz in list(range(0,l1)):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
    distance = float(matrix[l2][l1])
    result = 1.0-distance/max(l1,l2)
    return result

如果不需要标准化,那么应该很容易删除代码的最后部分。

票数 1
EN
页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://datascience.stackexchange.com/questions/60019

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档