素筛的想法很吸引人--但是如果你想让它更快的话,就必须分割它来增加代码的局部性。
Walish的see_sieve.cpp(见下文)向我展示了它的工作速度--快速。我花了一段时间来理解这些段是如何管理的。一旦我的版本开始工作,它比.cpp快5%。我有点失望--瓦利什的cpp看起来比超快更优雅。
在从int切换到unsigned,从(r-1)/2切换到r/2之后,我的代码现在几乎和一样快了。这是我的第一个问题:is这是一个特例?所有的r's都是奇怪的,但是编译器可能不知道。无论如何,我把所有的都改为无符号的(除了main()和argc )。
我需要这个索引到值的转换(2*i+1) (或者反之亦然)。我认为,我的片段的紧凑性在内部循环中显示了,候选人被删除了:
segment[r/2] = 0; THIS VERSION
sieve[j] = false; WALISH'S这是我的第二个猜测:由于内存使用在这里是至关重要的,我通过省略(而不仅仅是跳过)偶数来获得速度。
加:我的片段开始时没有2-13倍数。
在这里,我的注释代码:
/* Segmented Prime Sieve with pre-sieved pattern
Segmentation tricks based on Kim Walish's segmented_sieve.cpp, and wikipedia */
/* Counts primes under 1G in 0.4s (2.3GHz i5) (above .cpp: 0.7s) */
#include
#include
#include
#include
#include
/* This "inflates" a preprime pattern like "101" to "10.101101101.01" (dot = new zero) */
unsigned grow_pat_by(unsigned p, char *pattern, unsigned patsize) {
unsigned i;
/* Multiply by p */
for (i = 1; i < p; i++)
memcpy(pattern + i*patsize, pattern, patsize);
/* Sieve out every p */
for (i = p/2; i < p*patsize; i += p)
pattern[i] = 0;
return p * patsize;
}
/* Classic/simple self sieve (odds only), with lower limit for pre-sieved sieves */
void sieve_above(char *sieve, unsigned start, unsigned max) {
unsigned p, pp;
for (p = start; p <= sqrt(max); p += 2)
if (sieve[p/2])
for (pp = p*p; pp <= max; pp += 2*p)
sieve[pp/2] = 0;
}
/* Translates an index in preprimes[] into number in primes[]. Return value could be used to know size/number */
unsigned * gather_primes(char *preprimes, unsigned min, unsigned max, unsigned *primes) {
unsigned i;
for (i = min/2; i <= max/2; i++)
if (preprimes[i])
*primes++ = 2*i+1;
return primes;
}
/* A new segment activates higher primes[] into new runners[]
* In a segment[], the numbers are indexed: num=2*i+1 and i=(num-1)/2, but the "-1" does not matter (odd integers)
* Unsigned makes it about 20% faster */
void do_segment(unsigned segm_start, unsigned segm_end, unsigned char *segment, unsigned *primes, unsigned *runners, unsigned *irun) {
/* Do new prime squares fit in this segment? */
unsigned p;
while((p = primes[*irun]) > 0 && p*p <= segm_end)
runners[(*irun)++] = p*p - segm_start;
/* Run each runner[i] through and past the end of this segment */
unsigned segm_size = segm_end - segm_start;
unsigned i, r;
for(i = 0; (r = runners[i]) > 0; i++) {
while (r < segm_size) {
segment[r/2] = 0;
r += 2*primes[i];
}
/* Reset for next segment: same pattern, different offsets */
runners[i] = r - segm_size;
}
}
/* Initialize pattern[], preprimes[], primes[] and allocate runners[]. Prepare segment[] and call do_segment() */
void segmented_sieve(const unsigned MAX) {
const unsigned SQR = sqrt(MAX);
unsigned i;
/* pattern[] is used for preprimes[] and segment[]
* "101" is "1-(3)-5" i.e. "3" is sieved out. Even numbers are left out. Relation: number value is 2*index + 1 */
static char pattern[3*5*7*11*13 * 2] = {1, 0, 1};
unsigned patsize = 3;
/* grow (multiply and sieve) until patsize reaches magical ~30KB*/
patsize = grow_pat_by( 5, pattern, patsize);
patsize = grow_pat_by( 7, pattern, patsize);
patsize = grow_pat_by(11, pattern, patsize);
patsize = grow_pat_by(13, pattern, patsize);
/* preprimes[] shall deliver primes up to ~65K = sqrt(4G) = 32KB phys. (no odds) */
/* So copy pattern twice (phys. patsize: 15015)*/
char *preprimes = malloc(256*256/2);
memcpy(preprimes, pattern, patsize);
memcpy(preprimes+patsize, pattern, patsize);
/* Primes 2-13 can be skipped */
const unsigned FIRST = 17;
sieve_above(preprimes, FIRST, SQR);
/* Fill primes[] up to SQR, followed by zeroes. 7000 is enough to hold primes up to 65500 */
unsigned *primes = calloc(7000, sizeof*primes);
gather_primes(preprimes, FIRST, SQR, primes);
free(preprimes);
/* Optimize: stretch pattern[] by 2 (without sieving) to make segments a bit bigger (30KB) */
memcpy(pattern + patsize, pattern, patsize);
patsize *= 2;
/* The segment is a working copy of pattern[] that should fit into lowest cache
* 30030 is a primorial, but the "x2" only slips back in because of "stretching" of the odds-only 15015-pattern
* Without stretching, or with a second one, speed goes down (locality/cache) */
static unsigned char segment[3*5*7*11*13 * 2];
/* runners[i] has primes[i]'s current offset for do_segment() */
unsigned *runners = calloc(7000, sizeof*runners);
/* Segments advance in numbers by 2xpatsize (no odds in pattern)
* First one has "1" set, but lacks "2,3,5,7,11,13", thus "count" starts at 5. Last segment is simply truncated to MAX */
unsigned count = 5;
unsigned irun = 0;
unsigned segm_start, segm_end;
for (segm_start = 0; segm_start < MAX; segm_start += 2*patsize) {
/* Fresh pattern for segment */
memcpy(segment, pattern, patsize);
segm_end = segm_start + 2*patsize;
if (segm_end > MAX)
segm_end = MAX;
do_segment(segm_start, segm_end, segment, primes, runners, &irun);
/* Count (and possibly print by numbers) the sieved segment */
for(i=0; 2*i+1 < segm_end - segm_start; i++) {
count += segment[i];
//if (segment[i])
//printf("%d\n", segm_start + 2*i+1);
}
}
printf("PI of %d (prime count): %d\n", MAX, count);
return ;
}
int main(int argc, char** argv) {
if (argc < 2) {
printf("No limit given\n");
return 1;
}
unsigned MAX = atoi(argv[1]);
if (MAX < 100)
printf("Choose a number above 100, and below 4G. Not: %d\n", MAX);
segmented_sieve(MAX);
return 0;
}这里是Walish的c++版本;查看他的for(;...循环:
kim.walisch@gmail.com/@简短--这是一个简单的/ Eratosthenes分割筛的实现,并进行了一些优化。它在/IntelCorei7-6700 3.4GHz上用0.8秒(单线程)生成低于10^9的/素数。//@许可公共域。#include # #include # #include #include /在这里设置CPU的L1数据缓存大小(以字节为单位),Constint64_t L1D_CACHE_SIZE = 32768;//使用Eratosthenes的分段筛子生成素数。/该算法使用O(n log n)操作和O(sqrt(n))空间。/ @param限制筛子素数<=限制。/ segmented_sieve(int64_t极限){ int64_t sqrt = ( int64_t ) std::sqrt(极限);int64_t segment_size = std::max(sqrt,L1D_CACHE_SIZE);int64_t计数=(极限< 2)?0: 1;//我们筛选引物>= 3 int64_t i= 3;int64_t n= 3;int64_t S= 3;std::vector (Segment_size);std::vector is_prime(sqrt + 1,true);std::vector素数;std::vector倍数;for (int64_t low = 0;低<=极限;低+= segment_size) { std::fill(sieve.begin(),sieve.end(),true);// current段= int64_t high =低+ segment_size - 1;high =std:min(高,限);//使用Eratosthenes的简单筛子生成筛选素数,用于(;i*i <= high;i += 2) if ( is_prime ) for (int64_t j=i* i;j <= sqrt;j += i) is_prime= false;//初始化用于分段筛子的筛分素数:(;S*S <= high;<= += 2) { if (is_prime) {primes.push_back();multiples.push_back(s *S- low);} //筛(std::size_t i= 0;i< primes.size();i++) { int64_t j=倍数;(int64_t k=素数* 2;j< segment_size;j += k)筛子=假;倍数=j- segment_size;} for (;n <=高);N += 2)如果(count++) // n是素数count++;} std::cout << count <<“primes .”<< std::endl;}/用法:./segmented_<= 2) segmented_sieve(std:环礁(Argv);<< segmented_sieve(1000000000);返回0;}
我可以用-O2确认他在i5-8xxx上的0.8s。我的版本需要-O3 (和-lm)。
我能进一步提高速度和/或设计吗?
如果它变得更快,我将不得不从unsigned改为long for MAX。我测试的最高值是30亿。
需要考虑的数学问题太多了,然后编译器需要帮助将17转化为8:10001 -> 1000。
(哦,这只是一次右转)
更新:变量可以保持int与segment[r>>1] = 0 (430 or ),而不是[r/2] (500 Or)甚至[(r-1)/2] (550 Or)。
使用unsigned r时,r>>1和r/2都是410 are。足够好了。
上面的版本(每个人都没有签名)仍然是最快的:400-405 is。(我见过0.400次,但从未见过0.399次(10亿次))
通常,这类优化是不赞成的,但这种内部循环在紧凑型的赔率段上并不是“正常”的。
unsigned *primes的不同之处在于:402 makes。因此,90%以上的unsigneds是不必要的。
发布于 2020-12-30 15:00:51
参数检查-我们测试缺少的参数;如果有额外的参数,我们也应该失败,而不仅仅是忽略它们。atoi()是一个很差的函数,可以将字符串转换为整数--特别是它接受负值,生成有符号的值,并且它不拒绝超出范围的值或尾随垃圾。考虑一下strtoul()或strtoull()。
在检查之后,我们应该打印一条错误消息(到stderr,而不是stdout),然后以失败状态退出。
对签名和未签名的处理很差;我的第一个测试给出了奇怪的输出:
./254091 3000000000
PI of -1294967296 (prime count): 144449537这是很容易避免的,看起来你对正确性一点也不在意。
正如您在描述中所暗示的,您可能需要更改用于算术的类型。但是它分散在代码中,给你很多地方可以改变。一个更好的开始应该是添加一个ty胡枝子,它可以很容易地被更改为可以使用更广泛的类型。
这里有一个更好的main():
int main(int argc, char** argv)
{
if (argc != 2) {
fprintf(stderr, "Usage: %s LIMIT\n", argv[0]);
return EXIT_FAILURE;
}
char *end;
unsigned long max = strtoul(argv[1], &end, 0);
if (*end || max < 100 || max > UINT_MAX) {
fprintf(stderr, "Choose a number above 100, and below %u. Not: %s\n",
UINT_MAX, argv[1]);
return EXIT_FAILURE;
}
segmented_sieve((unsigned int)max);
}发布于 2020-12-31 23:30:48
一些关于预先形成功能的关注是正确的,而不仅仅是快速。
Slow和.
sqrt(max)可能会对每个循环进行评估。这给了经典的方法一种不公平的业绩评估。
for (p = start; p <= sqrt(max); p += 2)有些编译器会“知道”sqrt(max)每次都会返回相同的结果,因此只调用sqrt(max)一次。这是不应该依赖的事情发生。
// Alternative
unsigned end = sqrt(max);
for (p = start; p <= end; p += 2)...也许是错误的
没有指定sqrt()来返回完全平方的确切平方根。随着高质量的实施,这个问题是罕见的。不过,总比整数截断好。
unsigned end = ceil(sqrt(max));当max (可能是64位整数类型)转换为double、sqrt()计算中不精确以及将结果从double转换为整数类型时,可能会出现问题。
代码可以使用您自己的isqrt()来避免这种情况--或者使用for (p = start; p <= max/p; p += 2)。看下一个。
Code风险溢出
当i * i <= high接近类型的最大值时,high可能会溢出。考虑i*i <= high可以是真的,但是下一次迭代中的(i+2)*(i+2)会溢出。
当i > 0时不会溢出的替换:
for (; i <= high/i; i += 2)发布于 2020-12-31 23:43:21
在这里,当计算剩余的筛分指数时,主要的差别是可见的:
n += 2
for (; n <= high; n += 2)
if (sieve[n - low]) // n is a prime
count++;i++
for(i=0; 2*i+1 < segm_end - segm_start; i++) {
count += segment[i];我的片段是双密度的,因为它们只有赔率。是的,+= 2跳过事件很容易,但浪费缓存空间。
2*i+1不够复杂,不足以失去这一优势。一个问题。使此i无签名无助于此。
这里有一个只有必要的unsigned的版本和一个关于该[r>>1]的注释。
void do_segment(int segm_start, int segm_end, // real numbers
char *segment, unsigned *primes, int *runners, // arrays
int *irun) { // activated primes/runners
unsigned p;
while((p = primes[*irun]) > 0 && p*p <= segm_end)
runners[(*irun)++] = p*p - segm_start;
unsigned segm_size = segm_end - segm_start;
unsigned i, r;
for(i = 0; (r = runners[i]) > 0; i++) {
while (r < segm_size) {
segment[r>>1] = 0; // or [r/2] but not [(r-1)/2] (1% slower)
r += 2 * primes[i];
}
runners[i] = r - segm_size;
}
}https://codereview.stackexchange.com/questions/254091
复制相似问题