文章/答案/技术大牛

发布

社区首页 >问答首页 >Cython Damerau-Levenshtein加速

问Cython Damerau-Levenshtein加速
EN

Stack Overflow用户

提问于 2011-04-07 20:37:03

回答 6查看 1.5K关注 0票数 2

我有以下基于this Wikipedia article的计算2个字符串的Damerau-Levenshtein距离的cython实现，但目前它对于我的需要来说太慢了。我有一个大约有600000个字符串的列表，我必须在这个列表中找出拼写错误。

如果任何人能提出任何算法改进或一些python/cython魔术来减少脚本的运行时间，我将非常高兴。我真的不关心它使用了多少空间，只关心计算所需的时间。

根据对使用了大约2000个字符串的脚本的分析，它在damerauLevenshteinDistance函数中花费了整个运行时间的80% (24 / 30秒)，我完全不知道如何让它更快。

def damerauLevenshteinDistance(a, b, h):
    """
    a = source sequence
    b = comparing sequence
    h = matrix to store the metrics (currently nested list)
    """
    cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
    alphabet = getAlphabet((a,b))
    lena = len(a)
    lenb = len(b)
    inf = lena + lenb + 1
    da = [0 for x in xrange(0, len(alphabet))]
    for i in xrange(1, lena+1):
        db = 0
        for j in xrange(1, lenb+1):
            i1 = da[alphabet[b[j-1]]]
            j1 = db
            d = 1
            if (a[i-1] == b[j-1]):
                d = 0
                db = j
            h[i+1][j+1] = min(
                h[i][j]+d,
                h[i+1][j]+1,
                h[i][j+1]+1,
                h[i1][j1]+(i-i1-1)+1+(j-j1-1)
            )
        da[alphabet[a[i-1]]] = i
    return h[lena+1][lenb+1]

cdef getAlphabet(words):
    """
    construct an alphabet out of the lists found in the tuple words with a
    sequential identifier for each word
    """
    cdef int i
    alphabet = {}
    i = 0
    for wordList in words:
        for letter in wordList:
            if letter not in alphabet:
                alphabet[letter] = i
                i += 1
    return alphabet

python

string-matching

cython

回答 6

Stack Overflow用户

回答已采纳

发布于 2011-04-07 21:21:10

至少对于较长的字符串，您应该通过使用不同的算法来获得更好的性能，该算法不必计算lena⋅lenb矩阵中的所有值。例如，通常不需要计算矩阵的[lena][0]角的确切成本，该成本表示通过删除a中的所有字符开始的成本。

一种更好的算法可能是始终查看到目前为止计算出的权重最低的点，然后从那里向所有方向前进一步。这样，您无需检查矩阵中的所有位置即可到达目标位置：

此算法的实现可以使用优先级队列，如下所示：

from heapq import heappop, heappush

def distance(a, b):
   pq = [(0,0,0)]
   lena = len(a)
   lenb = len(b)
   while True:
      (wgh, i, j) = heappop(pq)
      if i == lena and j == lenb:
         return wgh
      if i < lena:
         # deleted
         heappush(pq, (wgh+1, i+1, j))
      if j < lenb:
         # inserted
         heappush(pq, (wgh+1, i, j+1))
      if i < lena and j < lenb:
         if a[i] == b[i]:
            # unchanged
            heappush(pq, (wgh, i+1, j+1))
         else:
            # changed
            heappush(pq, (wgh+1, i+1, j+1))
      # ... more possibilities for changes, like your "+(i-i1-1)+1+(j-j1-1)"

这只是一个粗略的实现，它可以改进很多：

向队列添加新坐标时，请选中：
- 如果坐标之前已处理过，则不要再次添加它们
- 如果坐标当前在队列中，则仅保留附加了较好的weight

的实例

使用用C实现的优先级队列，而不是heapq模块

票数 0

Stack Overflow用户

发布于 2011-10-14 23:30:12

如果在搜索中返回了几个单词(如果您需要多次计算相同输入字符串的Damerau Levenshtein距离)，您可以考虑使用字典(或hashmap)来缓存结果。下面是一个用C#实现的代码：

    private static Dictionary<int, Dictionary<int, int>> DamerauLevenshteinDictionary = new Dictionary<int, Dictionary<int, int>>();

    public static int DamerauLevenshteinDistanceWithDictionaryCaching(string word1, string word2)
    {
        Dictionary<int, int> word1Dictionary;

        if (DamerauLevenshteinDictionary.TryGetValue(word1.GetHashCode(), out word1Dictionary))
        {
            int distance;

            if (word1Dictionary.TryGetValue(word2.GetHashCode(), out distance))
            {
                // The distance is already in the dictionary
                return distance;
            }
            else
            {
                // The word1 has been found in the dictionary, but the matching with word2 hasn't been found.
                distance = DamerauLevenshteinDistance(word1, word2);
                DamerauLevenshteinDictionary[word1.GetHashCode()].Add(word2.GetHashCode(), distance);
                return distance;
            }
        }
        else
        {
            // The word1 hasn't been found in the dictionary, we must add an entry to the dictionary with that match.
            int distance = DamerauLevenshteinDistance(word1, word2);
            Dictionary<int, int> dictionaryToAdd = new Dictionary<int,int>();
            dictionaryToAdd.Add(word2.GetHashCode(), distance);
            DamerauLevenshteinDictionary.Add(word1.GetHashCode(), dictionaryToAdd);
            return distance;
        }
    }

票数 1

Stack Overflow用户

发布于 2011-04-07 21:23:19

看起来你可以静态地输入比你现在更多的代码，这将提高速度。

您还可以在Cython语言中查看Levenshtein Distance的实现，例如：http://hackmap.blogspot.com/2008/04/levenshtein-in-cython.html

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/5581120

复制

相似问题

问Cython Damerau-Levenshtein加速
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Cython Damerau-Levenshtein加速EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Cython Damerau-Levenshtein加速
EN