首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Damerau-Levenshtein距离算法,禁用删除计数

Damerau-Levenshtein距离算法,禁用删除计数
EN

Stack Overflow用户
提问于 2012-08-19 14:34:47
回答 3查看 3.1K关注 0票数 3

我如何禁用删除计数,在这个实现的Damerau-Levenshtein距离算法,或如果有其他的算法已经实现,请指出它。

示例(禁用删除计数):

string1:你好吗?

string2: oyu怎么样了?

距离:1(对于4删除不计算)

下面是算法:

代码语言:javascript
复制
    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer_transpositionRow
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
            ? int.MaxValue
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2012-08-23 08:48:07

代码语言:javascript
复制
 public static int DamerauLevenshteinDistance( string string1
                                            , string string2
                                            , int threshold)
{
    // Return trivial case - where they are equal
    if (string1.Equals(string2))
        return 0;

    // Return trivial case - where one is empty
    // WRONG FOR YOUR NEEDS: 
    // if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
    //      return (string1 ?? "").Length + (string2 ?? "").Length;

    //DO IT THIS WAY:
    if (String.IsNullOrEmpty(string1))
        // First string is empty, so every character of 
        // String2 has been inserted:
        return (string2 ?? "").Length;
    if (String.IsNullOrEmpty(string2))
        // Second string is empty, so every character of string1 
        // has been deleted, but you dont count deletions:
        return 0;

    // DO NOT SWAP THE STRINGS IF YOU WANT TO DEAL WITH INSERTIONS
    // IN A DIFFERENT MANNER THEN WITH DELETIONS:
    // THE FOLLOWING IS WRONG FOR YOUR NEEDS:
    // // Ensure string2 (inner cycle) is longer_transpositionRow
    // if (string1.Length > string2.Length)
    // {
    //     var tmp = string1;
    //     string1 = string2;
    //     string2 = tmp;
    // }

    // Return trivial case - where string1 is contained within string2
    if (string2.Contains(string1))
        //all changes are insertions
        return string2.Length - string1.Length;

    // REVERSE CASE: STRING2 IS CONTAINED WITHIN STRING1
    if (string1.Contains(string2))
        //all changes are deletions which you don't count:
        return 0;

    var length1 = string1.Length;
    var length2 = string2.Length;


    // PAY ATTENTION TO THIS CHANGE!
    // length1+1 rows is way too much! You need only 3 rows (0, 1 and 2)
    // read my explanation below the code!
    // TOO MUCH ROWS: var d = new int[length1 + 1, length2 + 1];
    var d = new int[2, length2 + 1];

    // THIS INITIALIZATION COUNTS DELETIONS. YOU DONT WANT IT
    // or (var i = 0; i <= d.GetUpperBound(0); i++)
    //    d[i, 0] = i;

    // But you must initiate the first element of each row with 0:
    for (var i = 0; i <= 2; i++)
        d[i, 0] = 0;


    // This initialization counts insertions. You need it, but for
    // better consistency of code I call the variable j (not i):
    for (var j = 0; j <= d.GetUpperBound(1); j++)
        d[0, j] = j;


    // Now do the job:
    // for (var i = 1; i <= d.GetUpperBound(0); i++)
    for (var i = 1; i <= length1; i++)
    {
        //Here in this for-loop: add "%3" to evey term 
        // that is used as first index of d!

        var im1 = i - 1;
        var im2 = i - 2;
        var minDistance = threshold;
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            var jm1 = j - 1;
            var jm2 = j - 2;
            var cost = string1[im1] == string2[jm1] ? 0 : 1;

            // DON'T COUNT DELETIONS!  var del = d[im1, j] + 1;
            var ins = d[i % 3, jm1] + 1;
            var sub = d[im1 % 3, jm1] + cost;

            // Math.Min is slower than native code
            // d[i, j] = Math.Min(del, Math.Min(ins, sub));
            // DEL DOES NOT EXIST  
            // d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
            d[i % 3, j] = ins <= sub ? ins : sub;

            if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                d[i % 3, j] = Math.Min(d[i % 3, j], d[im2 % 3, jm2] + cost);

            if (d[i % 3, j] < minDistance)
                minDistance = d[i % 3, j];
        }

        if (minDistance > threshold)
            return int.MaxValue;
    }

    return d[length1 % 3, d.GetUpperBound(1)] > threshold
        ? int.MaxValue
        : d[length1 % 3, d.GetUpperBound(1)];
}

来了我的解释为什么你只需要3行:

看看这一行:

代码语言:javascript
复制
var d = new int[length1 + 1, length2 + 1];

如果一个字符串的长度为n,另一个字符串的长度为m,那么您的代码需要一个(n+1)*(m+1)整数的空格。每个整数需要4字节。如果字符串很长,这就是浪费内存。如果这两个字符串都是35.000字节长的,则需要超过4GB的内存!

在这段代码中,您计算并为d[i,j]编写了一个新值。要做到这一点,您可以从它的上邻居(d[i,jm1]),从它的左邻居(d[im1,j]),从它的左上角邻居(d[im1,jm1]),从它的双上双左邻居(d[im2,jm2])读取值。因此,您只需要从实际行和之前的2行中获得值。

您不需要任何其他行的值。那你为什么要把它们储存起来?三行就足够了,我的更改使shure可以在任何时候使用这3行而不读取任何错误的值。

票数 6
EN

Stack Overflow用户

发布于 2012-08-21 20:09:49

我建议不要重写这个特定的算法来处理“免费”编辑的特定情况。其中许多从根本上简化了问题的概念,以至于度量无法传递任何有用的信息。

例如,当替换为空闲时,所有字符串之间的距离是它们之间长度的差异。只需将较小的字符串转换为较大字符串的前缀并添加所需的字母即可。(您可以保证没有较小的距离,因为编辑距离的每个字符都需要一个插入。)

当换位是免费的,问题减少到确定字母计数的差异之和。(由于所有字谜之间的距离为0,所以对每个字符串中的字母进行排序和交换或删除较大字符串的非公共元素是最佳策略。数学参数类似于前面的例子。)

在这种情况下,当插入和删除空闲时,任意两个字符串之间的编辑距离为零。如果只有插入或删除是免费的,这就破坏了距离度量的对称性--如果使用自由删除,从a到aa的距离是1,而从aa到a的距离是1。根据应用程序的不同,这可能是可取的;但我不确定这是否是您感兴趣的东西。您将需要极大地改变所提出的算法,因为它使得所提到的一个字符串的假设总是比另一个字符串长。

票数 2
EN

Stack Overflow用户

发布于 2012-08-20 10:37:06

尝试将var del = d[im1, j] + 1;改为var del = d[im1, j];,我认为这解决了您的问题。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/12027324

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档