文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么statistics.mean()这么慢？

问为什么statistics.mean()这么慢？
EN

Stack Overflow用户

提问于 2016-05-30 21:05:05

回答 6查看 6.2K关注 0票数 53

我比较了statistics模块的statistics函数和简单的sum(l)/len(l)方法的性能，发现mean函数由于某些原因非常慢。我在下面的两个代码片段中使用了timeit来比较它们，有人知道是什么导致了执行速度的巨大差异吗？我正在使用Python3.5。

from timeit import repeat
print(min(repeat('mean(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码大约在0.043秒内在我的机器上执行。

from timeit import repeat
print(min(repeat('sum(l)/len(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码大约在0.000565秒内在我的机器上执行。

mean

python

performance

回答 6

Stack Overflow用户

回答已采纳

发布于 2016-05-30 21:22:30

Python的statistics模块不是为了速度，而是为了精确

在这个模块的规格中，似乎

当处理大小相差很大的浮标时，内置的总和可能会失去准确性。因此，上述天真的手段未能通过这一“酷刑测试”。 assert mean([1e30, 1, 3, -1e30]) == 1 返回0而不是1，纯计算错误为100%。在mean中使用math.fsum将使它对浮点数据更加准确，但它也有副作用，即使没有必要，也会将任何参数转换为浮动。我们应该期望分数列表的平均值是分数，而不是浮点数。

相反，如果我们看一下这个模块中_sum()的实现，方法的docstring 似乎证实了的第一行

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    [...] """

因此，statistics实现sum，而不是对Python内置的sum()函数进行简单的一行调用，它本身大约需要20行，体内有一个嵌套的for循环。

这是因为statistics._sum选择保证它可能遇到的所有类型的数字的最大精度(即使它们彼此之间有很大的差异)，而不是简单地强调速度。

因此，正常情况下，内置sum的速度比以前快了100倍。它的成本是一个低得多的精度在你恰巧用奇异的数字来称呼它。

其他选项

如果您需要在算法中对速度进行优先排序，则应该转而查看Numpy，其算法正在用C语言实现。

从长远来看，NumPy的意思并不像statistics那么精确，但它实现了(自2013年起)一个比天真的sum/len (链接中有更多的信息)更好的基于成对求和的例程。

然而..。

import numpy as np
import statistics

np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])

print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))

> NumPy mean: 0.0
> Statistics mean: 1.0

票数 77

Stack Overflow用户

发布于 2016-05-30 21:27:58

如果你确实注意速度，那么就用矮胖/枕木/熊猫代替：

In [119]: from random import randint; from statistics import mean; import numpy as np;

In [122]: l=[randint(0, 10000) for i in range(10**6)]

In [123]: mean(l)
Out[123]: 5001.992355

In [124]: %timeit mean(l)
1 loop, best of 3: 2.01 s per loop

In [125]: a = np.array(l)

In [126]: np.mean(a)
Out[126]: 5001.9923550000003

In [127]: %timeit np.mean(a)
100 loops, best of 3: 2.87 ms per loop

结论：--它会快几个数量级--在我的例子中，速度是700倍，但可能没有那么精确(因为numpy不使用Kahan求和算法)。

票数 8

Stack Overflow用户

发布于 2016-05-30 21:14:20

不久前，我也问了同样的问题，但是有一次我注意到_sum函数在源代码中平均地在317上被调用，我理解了为什么：

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)
    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.
    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.
    Examples
    --------
    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)
    Some sources of round-off error will be avoided:
    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
    (<class 'float'>, Fraction(1000, 1), 3000)
    Fractions and Decimals are also supported:
    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)
    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
    return (T, total, count)

与调用内置sum相比，有许多操作正在发生，根据文档字符串，mean计算高精度和。

您可以看到使用mean可以给出不同的输出：

In [7]: l = [.1, .12312, 2.112, .12131]

In [8]: sum(l) / len(l)
Out[8]: 0.6141074999999999

In [9]: mean(l)
Out[9]: 0.6141075

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37533666

复制

相似问题

问为什么statistics.mean()这么慢？
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么statistics.mean()这么慢？EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么statistics.mean()这么慢？
EN