文章/答案/技术大牛

发布

社区首页 >问答首页 >基于值的字典顶部k键的有效跟踪

问基于值的字典顶部k键的有效跟踪
EN

Stack Overflow用户

提问于 2013-03-15 06:50:19

回答 3查看 2.7K关注 0票数 4

在字典更新其键时，如何有效地跟踪具有最大值的字典的顶部k键？

我尝试过在每次更新后(如在字典中获取带最大值的键吗？中所描述)从字典中创建排序列表的天真方法，但是这非常昂贵，而且不缩放。

现实世界的例子：

计算来自无限数据流的单词频率。在任何给定的时刻，程序可能会被要求报告一个单词是否位于当前的 top-k最频繁的值中。我们如何有效地完成这个

collections.Counter太慢了

>>> from itertools import permutations
>>> from collections import Counter
>>> from timeit import timeit
>>> c = Counter()
>>> for x in permutations(xrange(10), 10):
    c[x] += 1


>>> timeit('c.most_common(1)', 'from __main__ import c', number=1)
0.7442058258093311
>>> sum(c.values())
3628800

计算这个值需要将近一秒钟！

我正在寻找一个most_common()函数的O(1)时间。这应该可以通过另一个数据结构来实现，它只在内部存储当前的top-k项，并跟踪当前的最小值。

sorting

dictionary

scalability

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2013-03-15 09:23:36

我们可以实现一个跟踪top-k值的类，因为我不认为标准库有这个内置的。这将与主字典对象(可能是Counter)并行更新。您还可以将它用作主字典对象的子类的属性。

实现

class MostCommon(object):
    """Keep track the top-k key-value pairs.

    Attributes:
        top: Integer representing the top-k items to keep track of.
        store: Dictionary of the top-k items.
        min: The current minimum of any top-k item.
        min_set: Set where keys are counts, and values are the set of
            keys with that count.
    """
    def __init__(self, top):
        """Create a new MostCommon object to track key-value paris.

        Args:
            top: Integer representing the top-k values to keep track of.
        """
        self.top = top
        self.store = dict()
        self.min = None
        self.min_set = defaultdict(set)

    def _update_existing(self, key, value):
        """Update an item that is already one of the top-k values."""
        # Currently handle values that are non-decreasing.
        assert value > self.store[key]
        self.min_set[self.store[key]].remove(key)
        if self.store[key] == self.min:  # Previously was the minimum.
            if not self.min_set[self.store[key]]:  # No more minimums.
                del self.min_set[self.store[key]]
                self.min_set[value].add(key)
                self.min = min(self.min_set.keys())
        self.min_set[value].add(key)
        self.store[key] = value

    def __contains__(self, key):
        """Boolean if the key is one of the top-k items."""
        return key in self.store

    def __setitem__(self, key, value):
        """Assign a value to a key.

        The item won't be stored if it is less than the minimum (and
        the store is already full). If the item is already in the store,
        the value will be updated along with the `min` if necessary.
        """
        # Store it if we aren't full yet.
        if len(self.store) < self.top:
            if key in self.store:  # We already have this item.
                self._update_existing(key, value)
            else:  # Brand new item.
                self.store[key] = value
                self.min_set[value].add(key)
                if value < self.min or self.min is None:
                    self.min = value
        else:  # We're full. The value must be greater minimum to be added.
            if value > self.min:  # New item must be larger than current min.
                if key in self.store:  # We already have this item.
                    self._update_existing(key, value)
                else:  # Brand new item.
                    # Make room by removing one of the current minimums.
                    old = self.min_set[self.min].pop()
                    del self.store[old]
                    # Delete the set if there are no old minimums left.
                    if not self.min_set[self.min]:
                        del self.min_set[self.min]
                    # Add the new item.
                    self.min_set[value].add(key)
                    self.store[key] = value
                    self.min = min(self.min_set.keys())

    def __repr__(self):
        if len(self.store) < 10:
            store = repr(self.store)
        else:
            length = len(self.store)
            largest = max(self.store.itervalues())
            store = '<len={length}, max={largest}>'.format(length=length,
                                                           largest=largest)
        return ('{self.__class__.__name__}(top={self.top}, min={self.min}, '
                'store={store})'.format(self=self, store=store))

示例用法

>>> common = MostCommon(2)
>>> common
MostCommon(top=2, min=None, store={})
>>> common['a'] = 1
>>> common
MostCommon(top=2, min=1, store={'a': 1})
>>> 'a' in common
True
>>> common['b'] = 2
>>> common
MostCommon(top=2, min=1, store={'a': 1, 'b': 2})
>>> common['c'] = 3
>>> common
MostCommon(top=2, min=2, store={'c': 3, 'b': 2})
>>> 'a' in common
False
>>> common['b'] = 4
>>> common
MostCommon(top=2, min=3, store={'c': 3, 'b': 4})

更新值后的访问确实是O(1)

>>> counter = Counter()
>>> for x in permutations(xrange(10), 10):
        counter[x] += 1

>>> common = MostCommon(1)
>>> for key, value in counter.iteritems():
    common[key] = value

>>> common
MostCommon(top=1, min=1, store={(9, 7, 8, 0, 2, 6, 5, 4, 3, 1): 1})
>>> timeit('repr(common)', 'from __main__ import common', number=1)
1.3251570635475218e-05

访问是O(1)，但当设置项调用(即O(n)操作)期间发生最小更改时，n是最高值的数量。这仍然比Counter更好，在每次访问中都是O(n)，其中n是整个词汇表的大小！

票数 0

Stack Overflow用户

发布于 2013-03-15 08:14:31

collections.Counter.most_common 通过遍历所有的值，找到N个最大的值，方法是将它们放在一个堆中。 (我认为，在O(M )时间内，M是字典元素的总数)。

正如when在注释中所建议的那样，heapq可能工作得很好:与字典并行，维护N个最大值的heapq，以及当您修改dict时，检查这个值是在那里还是现在应该在那里。问题是，正如您已经注意到的，接口实际上没有任何方法来修改已经存在的元素的“优先级”(在您的例子中是负的，因为它是一个最小堆计数数)。

您可以就地修改相关项目，然后运行heapq.heapify以恢复堆重。这需要堆(N)大小的线性传递来查找相关项(除非您正在进行额外的簿记以将元素与位置相关联；可能不值得)，而另一次线性传递则需要重新堆化。在列表中没有元素的情况下，现在需要将元素添加到堆中，方法是替换最小的元素(在线性时间内，不需要其他结构)。

不过，heapq专用接口包含一个函数_siftdown，它具有以下注释：

# 'heap' is a heap at all indices >= startpos, except possibly for pos.  pos
# is the index of a leaf with a possibly out-of-order value.  Restore the
# heap invariant.

听起来不错！调用heapq._siftdown(heap, 0, pos_of_relevant_idx)将在日志N时间内修复堆。当然，您必须先找到要递增的索引的位置，这需要线性时间。您可能会为索引维护一个元素字典，以避免这种情况(还保留一个指向最小元素位置的指针)，但随后您必须复制_siftdown的源并修改它，以便在它交换东西时更新该字典，或者做一个线性时间传递来重新构建字典(但您只是试图避免线性传递.)。

小心点，这应该算到O(log )时间。然而，事实证明，在(摊销)恒定时间内，有一种叫做斐波纳契堆的东西确实支持您所需要的所有操作。不幸的是，这是一个大O不是全部的情况；Fibonacci堆的复杂性意味着，在实践中，除了非常大的堆之外，它们实际上并不比二进制堆更快。此外(可能是“因此”)，我在快速搜索中没有找到一个标准的Python实现，尽管Boost C++库确实包括一个。

我首先尝试使用heapq，对正在更改的元素进行线性搜索，并调用_siftdown；与Counter方法的O(M )相比，这是O(N)时间。如果结果证明速度太慢，您可以维护额外的索引字典，并制作自己版本的_siftdown来更新dict，这应该会导致O(log )时间的结束。如果这仍然太慢(我对此表示怀疑)，您可以使用Python包装器来增强Fibonacci堆(或其他实现)，但我真的怀疑这是否值得。

票数 2

Stack Overflow用户

发布于 2013-03-15 06:52:28

使用collections.Counter，它已经在现实世界的例子中这样做了。你还有其他用例吗？

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15426421

复制

相似问题

问基于值的字典顶部k键的有效跟踪
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于值的字典顶部k键的有效跟踪EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于值的字典顶部k键的有效跟踪
EN