文章/答案/技术大牛

发布

问最长公共子串优化
EN

Stack Overflow用户

提问于 2012-05-07 07:10:56

回答 1查看 532关注 0票数 2

有没有人能帮我优化我最长的公共子字符串问题？我必须读取非常大的文件(高达2 Gb)，但我不知道该使用哪种结构…在c++中没有散列映射..TBB中存在并发散列映射，但该算法使用起来非常复杂。我已经用**L矩阵解决了这个问题，但它很贪婪，不能用于大输入。矩阵充满了零，这可以通过使用map>和只存储非零来消除，但这真的很慢，而且实际上无法使用。速度非常重要。代码如下：

// L[i][j] will contain length of the longest substring
    // ending by positions i in refSeq and j in otherSeq
    size_t **L = new size_t*[refSeq.length()];
    for(size_t i=0; i<refSeq.length();++i)
        L[i] = new size_t[otherSeq.length()];

    // iteration over the characters of the reference sequence
    for(size_t i=0; i<refSeq.length();i++){
        // iteration over the characters of the sequence to compare
        for(size_t j=0; j<otherSeq.length();j++){
            // if the characters are the same,
            // increase the consecutive matching score from the previous cell
            if(refSeq[i]==otherSeq[j]){
                if(i==0 || j==0)
                    L[i][j]=1;
                else
                    L[i][j] = L[i-1][j-1] + 1;
            }
            // or reset the matching score to 0
            else
                L[i][j]=0;
        }
    }

    // output the matches for this sequence
    // length must be at least minMatchLength
    // and the longest possible.
    for(size_t i=0; i<refSeq.length();i++){
        for(size_t j=0; j<otherSeq.length();j++){

            if(L[i][j]>=minMatchLength) {
                //this sequence is part of a longer one
                if(i+1<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i+1][j+1])
                    continue;
                //this sequence is part of a longer one
                if(i<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i][j+1])
                    continue;
                //this sequence is part of a longer one
                if(i+1<refSeq.length() && j<otherSeq.length() && L[i][j]<=L[i+1][j])
                    continue;
                cout << i-L[i][j]+2 << " " << i+1 << " " << j-L[i][j]+2 << " " << j+1 << "\n";

                // output the matching sequences for debugging :
                //cout << refSeq.substr(i-L[i][j]+1,L[i][j]) << "\n";
                //cout << otherSeq.substr(j-L[i][j]+1,L[i][j]) << "\n";
            }
        }
    }

performance

optimization

parallel-processing

nested-loops

longest-substring

回答 1

Stack Overflow用户

发布于 2012-05-10 01:02:36

关于同样的问题，还有一场英特尔竞赛。

也许他们会在结束后发布一些解决方案

http://software.intel.com/fr-fr/articles/AYC-early2012_home/

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/10475060

复制

相似问题

问最长公共子串优化
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最长公共子串优化EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最长公共子串优化
EN