文章/答案/技术大牛

发布

社区首页 >问答首页 >用于扫描字符串列表的Aho-Corasick算法

问用于扫描字符串列表的Aho-Corasick算法
EN

Code Review用户

提问于 2020-06-06 23:17:23

回答 1查看 241关注 0票数 2

这是对我以前的问题的后续，它是关于查找可迭代的最小值和最大值的。针对这一问题，提出了算法.下面是我使用阿糖皮图书馆的解决方案。

问题的短期再讨论：

你会得到两个数组(genes和health)，其中一个有“基因”的名字，另一个是“基因”的重量(也就是健康)。然后，您给出了一串字符串，每个字符串包含值m和n，它们表示要应用于genes和health数组的切片的开始和结束，以及“gene”-字符串，我们需要对它们确定是否健康。然后，我们需要为最健康的字符串和最不健康的字符串返回健康值。

我想这段代码可能有问题，但不确定是什么。它对小型测试用例很好，给出了与以前版本的解决方案大致相同的时间，但对于大型测试用例，我的PC基本上是挂起的。

一个小型测试案例的例子：

genes = ['a', 'b', 'c', 'aa', 'd', 'b']
health = [1, 2, 3, 4, 5, 6]
gene1 = "1 5 caaab" (result = 19 = max) 
gene2 = "0 4 xyz" (result = 0 = min) 
gene3 = "2 4 bcdybc" (result = 11)

大型testcase (2个列出每个100 K元素；testcase 41K+元素)：txt在我的dropbox (2,80 MB) (对pastebin来说太大了)

因此，我有两个问题：( 1)我的代码有什么问题，我如何才能赋予它的性能( 2)如何应用Aho-Corasick而不转到任何非标准库(因为，很可能，它不能安装在HackerRank服务器上)

def geneshealth(genes, health, testcase):
    from ahocorapy.keywordtree import KeywordTree
    import math

    min_weight = math.inf
    max_weight = -math.inf

    for case in testcase:
        #construct the keyword tree from appropriately sliced "genes" list
        kwtree = KeywordTree(case_insensitive=True)
        fl, ceil, g = case.split()
        for i in genes[int(fl):int(ceil)+1]:
            kwtree.add(i)
        kwtree.finalize()
        #search the testcase list for matches
        result = list(kwtree.search_all(g))

        hea = 0
        for gn, _ in result:
            for idx, val in enumerate(genes):
                if val == gn:
                    hea += health[idx]

        if hea < min_weight:
            min_weight = hea
        if hea > max_weight:
            max_weight = hea
    return(min_weight, max_weight)

python-3.x

programming-challenge

python

beginner

回答 1

Code Review用户

回答已采纳

发布于 2020-06-07 06:03:08

此代码很慢，因为：

它为每个测试用例构建一个新的关键字树。只要建造一次，利用所有的基因。
它构建了所有匹配关键字的列表。KeywordTree.search_all()是一个生成器，直接循环它。和,
它循环遍历基因列表来找到基因索引，这样它就可以找到health.，构建一个以基因作为键和一个(索引，健康)元组作为值的分块。

像这样的东西(未经测试)：

import math
from collections import defaultdict
from ahocorapy.keywordtree import KeywordTree


def geneshealth(genes, health, testcases):

    # build the kwtree using all the genes 
    kwtree = KeywordTree(case_insensitive=True)
    for gene in genes:
        kwtree.add(gene)
    kwtree.finalize()

    # build a dict that maps a gene to a list of (index, health) tuples
    index_and_health = defaultdict(list)
    for gene, data in zip(genes, enumerate(health)):
        index_and_health[gene].append(data)

    min_dna_health = math.inf
    max_dna_health = -math.inf

    for case in testcases:
        start, end, dna = case.split()
        start = int(start)
        end = int(end)

        dna_health = 0

        # search the dna for any genes in the kwtree
        # note: we don't care where the gene is in the dna
        for gene, _ in kwtree.search_all(dna):

            for gene_index, gene_health in index_and_health[gene]:

                # only genes that are within the testcase limits
                # contribute dna_health
                if start <= gene_index <= end:
                    dna_health += gene_health

        # keep the min/max weight
        if dna_health < min_dna_health:
            min_dna_health = dna_health

        if dna_health > max_dna_health:
            max_dna_health = dna_health

    return(min_dna_health, max_dna_health)

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/243500

复制

相似问题

问用于扫描字符串列表的Aho-Corasick算法
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于扫描字符串列表的Aho-Corasick算法EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于扫描字符串列表的Aho-Corasick算法
EN