我正在从这个链接黑客正午时间学习Timsort
有一条语句:“当运行次数等于或略小于2的幂时,合并2个数组更有效。Timsort选择minrun来确保这种效率,方法是确保minrun等于或小于2的幂。”
为什么“当运行次数等于或略小于2的幂时,合并2个数组更有效”?
发布于 2020-03-01 21:55:40
它适用于一般/随机情况下的平衡合并(如果给定的数据不是随机的,但具有长的自然运行,那么minrun和它的选择可能并不重要,而平衡更多地取决于运行的长度而不是运行的次数)。
在一般/随机的情况下,您希望生成精确的minrun元素的运行。然后,当运行次数为2的幂时,您就会在整个过程中得到完全平衡的合并。如果运行的次数略大于2的幂,则会导致非常不平衡的合并。如果运行的次数比2的幂小一点,则只会出现一点不平衡。
同样,对于一般/随机情况,这也是(至少大部分)。例如,如果您有9个自然运行长度为800,100,100,100,100,100,100,100,100,100,100元素,那么您也得到了完全平衡的合并,尽管运行的次数略大于2。
提姆在listsort.txt中对minrun的选择的描述摘录
Because sortperf.py only tries powers of 2, it took a long time to notice
that 32 isn't a good choice for the general case! Consider N=2112:
>>> divmod(2112, 32)
(66, 0)
>>>
If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32. The first 64 of these trigger a sequence of perfectly
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
compares, but it's still more compares than necessary, and-- mergesort's
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
to get 64 elements into place).
If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced. Better!https://stackoverflow.com/questions/60416523
复制相似问题