文章/答案/技术大牛

发布

社区首页 >问答首页 >字符串C#列表中字符串匹配的最优比较算法

问字符串C#列表中字符串匹配的最优比较算法
EN

Stack Overflow用户

提问于 2015-06-01 16:22:34

回答 5查看 4.3K关注 0票数 3

比如说我有10万字的单子。我想知道一个给定的字符串是否匹配该列表中的任何单词，并且我想以最快的方式完成它。此外，我还想知道是否有其他的单词，是由第一个字符开始形成的，在这个字符串中是否出现在列表中。

例如：

说你的绳子是"icedtgg“

“冰”“冰”

我试图想出一个最优的比较算法，它告诉我下面的单词是否在我的列表中。

到目前为止，我有10万字的列表存储在一个

Dicitonary<char, List<string>> WordList;

其中char是单词的第一个字符，而List<string>是以该字符开头的所有单词。

因此，WordList['a']列出了所有以'a‘开头的单词(“猿”、“苹果”、“艺术”等)。“b”列出了以b等开头的所有单词。

因为我知道我所有的单词都以'i‘开头，所以我首先可以将我的解决方案从100,000字缩小到仅仅以'i’开头的单词。

List<string> CurrentWordList = WordList['i'];

现在我检查

if( CurrentWordList[0].Length == 1 )

然后我知道我的第一个字符串是匹配的"i“，因为"i”将是列表中的第一个单词im。这些列表是预先按字母顺序排序的，这样不会减慢匹配的速度。

有什么想法吗？

*不，这不是一种HW辅助，我是一个程序软件架构师，试图为乐趣/爱好/游戏开发找到一个最优的匹配算法。

string

algorithm

optimization

language-agnostic

回答 5

Stack Overflow用户

回答已采纳

发布于 2015-06-02 01:35:41

我决定添加这个答案并不是因为它是解决您的问题的最佳解决方案，而是为了说明两个可能的解决方案，它们相对简单，并且在某种程度上符合您似乎遵循的方法。

下面的(非优化的)示例提供了一个极端的简单前缀trie实现，它使用每个消耗的字符一个节点。

public class SimplePrefixTrie
{
    private readonly Node _root = new Node(); // root represents empty string.

    private class Node
    {
        public Dictionary<char, Node> Children;
        public bool IsTerminal; // whether a full word ends here.

        public Node Find(string word, int index)
        {
            var child = default(Node);
            if (index < word.Length && Children != null)
                Children.TryGetValue(word[index], out child);
            return child;
        }

        public Node Add(string word, int toConsume)
        {
            var child = default(Node);
            if (toConsume == word.Length)
                this.IsTerminal = true;
            else if (Children == null || !Children.TryGetValue(word[toConsume], out child))
            {
                if (Children == null)
                    Children = new Dictionary<char, Node>();
                Children[word[toConsume]] = child = new Node();
            }
            return child;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ndx++);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
        {
            if (cur.IsTerminal)
                yield return searchWord.Substring(0, ndx);
            cur = cur.Find(searchWord, ndx++);
        }
    }
}

下面还添加了压缩前缀trie的一个简单实现。它遵循与上面的示例几乎相同的方法，但是存储共享前缀部分，而不是单个字符。当现有的存储前缀变为共享并需要分成两部分时，它会拆分节点。

public class SimpleCompressedPrefixTrie
{
    private readonly Node _root = new Node();

    private class Node
    {
        private Dictionary<char, Node> _children;
        public string PrefixValue = string.Empty;
        public bool IsTerminal;

        public Node Add(string word, ref int startIndex)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            if (n == PrefixValue.Length) // full prefix match
            {
                if (startIndex == word.Length) // full match
                    IsTerminal = true;
                else
                    return AddToChild(word, ref startIndex);
            }
            else // partial match, need to split this node's prefix.
                SplittingAdd(word, n, ref startIndex);
            return null;
        }

        public Node Find(string word, ref int startIndex, out int matchLen)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            matchLen = -1;
            if (n == PrefixValue.Length)
            {
                if (IsTerminal)
                    matchLen = startIndex;
                var child = default(Node);
                if (_children != null && startIndex < word.Length && _children.TryGetValue(word[startIndex], out child))
                {
                    startIndex++; // consumed map key character.
                    return child;
                }
            }
            return null;
        }

        private Node AddToChild(string word, ref int startIndex)
        {
            var key = word[startIndex++]; // consume the mapping character
            var nextNode = default(Node);
            if (_children == null)
                _children = new Dictionary<char, Node>();
            else if (_children.TryGetValue(key, out nextNode))
                return nextNode;
            var remainder = word.Substring(startIndex);
            _children[key] = new Node() { PrefixValue = remainder, IsTerminal = true };
            return null; // consumed.
        }

        private void SplittingAdd(string word, int n, ref int startIndex)
        {
            var curChildren = _children;
            _children = new Dictionary<char, Node>();
            _children[PrefixValue[n]] = new Node()
            {
                PrefixValue = this.PrefixValue.Substring(n + 1),
                IsTerminal = this.IsTerminal,
                _children = curChildren
            };
            PrefixValue = PrefixValue.Substring(0, n);
            IsTerminal = startIndex == word.Length;
            if (!IsTerminal)
            {
                var prefix = word.Length > startIndex + 1 ? word.Substring(startIndex + 1) : string.Empty;
                _children[word[startIndex]] = new Node() { PrefixValue = prefix, IsTerminal = true };
                startIndex++;
            }
        }

        private int FindSharedPrefix(string word, int startIndex)
        {
            var n = Math.Min(PrefixValue.Length, word.Length - startIndex);
            var len = 0;
            while (len < n && PrefixValue[len] == word[len + startIndex])
                len++;
            return len;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ref ndx);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var startNdx = 0;
        var cur = _root;
        while (cur != null)
        {
            var matchLen = 0;
            cur = cur.Find(searchWord, ref startNdx, out matchLen);
            if (matchLen > 0)
                yield return searchWord.Substring(0, matchLen);
        };
    }
}

用法示例：

var trie = new SimplePrefixTrie(); // or new SimpleCompressedPrefixTrie();
trie.AddWord("hello");
trie.AddWord("iced");
trie.AddWord("i");
trie.AddWord("ice");
trie.AddWord("icecone");
trie.AddWord("dtgg");
trie.AddWord("hicet");
foreach (var w in trie.FindWordsMatchingPrefixesOf("icedtgg"))
    Console.WriteLine(w);

产出：

i
ice
iced

更新:选择正确的数据结构

我认为一个更新可以提供一些价值来说明如何选择一个适合这个问题的数据结构是非常重要的，并且涉及到什么样的权衡。因此，我创建了一个小的基准应用程序，它测试到目前为止对这个问题提供的答案中的策略，而不是基准参考实现。

朴素：是最简单的天真解决方案。
JimMischel:基于这个答案的方法。
MattyMerrix:是基于您自己的答案这里。
JimMattyDSL:结合了'JimMischel‘和'MattyMerrix’的方法，并在排序列表中使用了更优化的二进制字符串搜索。
CompessedTrie和SimpleTrie基于这个答案中描述的两个实现。

完整的基准代码可以在这个要旨中找到。在10,000、100,000和1,000,000 (随机生成的字符序列)字典中运行它并搜索所有5,000个词的前缀匹配的结果如下：

将5000个单词匹配到10000个最大长度为25的字典中

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          0.64-0.64, 0.64     0.001-0.002, 0.001     6.136-6.312, 6.210
   JimMischel          0.84-0.84, 0.84     0.013-0.018, 0.016     0.083-0.113, 0.102
  JimMattyDSL          0.80-0.81, 0.80     0.013-0.018, 0.016     0.008-0.011, 0.010
   SimpleTrie       24.55-24.56, 24.56     0.042-0.056, 0.051     0.002-0.002, 0.002
CompessedTrie          1.84-1.84, 1.84     0.003-0.003, 0.003     0.002-0.002, 0.002
  MattyMerrix          0.83-0.83, 0.83     0.017-0.017, 0.017     0.034-0.034, 0.034

将5000个单词匹配到100000个最大长度为25的字典中

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          6.01-6.01, 6.01     0.024-0.026, 0.025  65.651-65.758, 65.715
   JimMischel          6.32-6.32, 6.32     0.232-0.236, 0.233     1.208-1.254, 1.235
  JimMattyDSL          5.95-5.96, 5.96     0.264-0.269, 0.266     0.050-0.052, 0.051
   SimpleTrie    226.49-226.49, 226.49     0.932-0.962, 0.951     0.004-0.004, 0.004
CompessedTrie       16.10-16.10, 16.10     0.101-0.126, 0.111     0.003-0.003, 0.003
  MattyMerrix          6.15-6.15, 6.15     0.254-0.269, 0.259     0.414-0.418, 0.416

将5000个单词匹配到1000000个最大长度为25的字典中

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
   JimMischel       57.69-57.69, 57.69     3.027-3.086, 3.052  16.341-16.415, 16.373
  JimMattyDSL       60.88-60.88, 60.88     3.396-3.484, 3.453     0.399-0.400, 0.399
   SimpleTrie 2124.57-2124.57, 2124.57  11.622-11.989, 11.860     0.006-0.006, 0.006
CompessedTrie    166.59-166.59, 166.59     2.813-2.832, 2.823     0.005-0.005, 0.005
  MattyMerrix       62.71-62.73, 62.72     3.230-3.270, 3.251     6.996-7.015, 7.008

如您所见，(非空间优化的)尝试所需的内存要高得多。它增加了所有测试实现的字典O(N)的大小。

正如预期的那样，尝试的查找时间或多或少是恒定的: O(k)，仅取决于搜索项的长度。对于其他实现，根据要搜索的字典的大小，时间将增加。

注意，可以为这个问题构造更多的最优实现，这将使搜索时间接近O(k)，并允许更紧凑的存储和减少内存占用。如果你映射到一个减少的字母表(例如。‘-’Z‘)，这也是可以利用的东西。

票数 6

Stack Overflow用户

发布于 2015-06-02 18:55:54

那么，您只想在字典中找到输入字符串的前缀单词吗？您可以比建议的任何方法更有效地完成这一任务。这真的只是一个修改过的合并。

如果您的单词列表包含一个由第一个字母组成的字典，每个条目都包含一个以该字母开头的单词排序列表，那么这就可以了。最坏的情况是O(n + m)，其中n是以字母开头的单词数，m是输入字符串的长度。

var inputString = "icegdt";
// get list of words that start with the first character
var wordsList = MyDictionary[input_string[0]];

// find all words that are prefixes of the input string
var iInput = 0;
var iWords = 0;
var prefix = inputString.Substring(0, iInput+1);
while (iInput < inputString.Length && iWords < wordsList.Count)
{
    if (wordsList[iWords] == prefix)
    {
        // wordsList[iWords] is found!
        ++iWords;
    }
    else if (wordsList[iWords] > prefix)
    {
        // The current word is alphabetically after the prefix.
        // So we need the next character.
        ++iInput;
        if (iInput < inputString.Length)
        {
            prefix = inputString.Substring(0, iInput+1);
        }
    }
    else
    {
        // The prefix is alphabetically after the current word.
        // Advance the current word.
        ++iWord;
    }
}

如果这是您要做的全部工作(查找作为输入字符串前缀的字典单词)，那么按第一个字符编制索引的字典没有特殊的原因。给定一个按顺序排列的单词列表，您可以在第一个字母上进行二进制搜索以找到起始点。这将比字典查找花费更多的时间，但与搜索单词列表以寻找匹配的时间相比，时间差将非常小。此外，排序的单词列表将比字典方法占用更少的内存。

如果要进行不区分大小写的比较，请将比较代码更改为：

    var result = String.Compare(wordsList[iWords], prefix, true);
    if (result == 0)
    {
        // wordsList[iWords] is found!
        ++iWords;
    }
    else if (result > 0)
    {

这也减少了每次迭代中字符串比较的次数，使之精确地减少到每迭代一个。

票数 2

Stack Overflow用户

发布于 2015-06-01 17:00:16

while (x < str.Length-1)
{
    if (ChrW(10) == GetChar(str, x) && ChrW(13) == GetChar(str, x+1))
     {
       // x+2 - This new line
     }
   x++;
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/30578450

复制

相似问题

问字符串C#列表中字符串匹配的最优比较算法
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串C#列表中字符串匹配的最优比较算法EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串C#列表中字符串匹配的最优比较算法
EN