http://mathoverflow.net/a/50691 is a program that generates a list of distinct prime factors for each number n, and temporarily stores the list (certificate of being composite) for n in an associative array cmp[], and then does other stuff related to number of distinct prime factors of the composite number.

To explain briefly the central for loop of n from 2 to limit, any key n in cmp[] is assumed to be a composite number, its certificate cmp[n] (a list of prime factors without multiplicity) is examined, and for each factor f in the certificate, the factor f is added to a certificate (a list at cmp[n+f] ) for n+f. If n is prime, the factor n is added as a certificate for the number n+n. I do not recall seeing this algorithm before, and am glad to see another version of it.

(The display at the link is not properly indented: the following statement “for(k in dir)…” is an inner loop which also does cool stuff but is not directly related to the Assembly Line algorithm.)

Gerhard “Ask Me About System Design” Paseman, 2015.03.31

]]>First, the more important thing O’Neill points out in her delightful paper is that trial division is slower than the Sieve not just because the individual operations are slower, but because it performs many fewer of them.

Consider, as an extreme case, the number 10975969, which is the square of the prime 3313, the 466th prime. The Sieve never touches bucket 10975969 until it reaches 3313, at which point it squares 3313, sets the bucket, adds 3313, discovers that the result is over 11 million, and stops. By contrast, trial division must attempt 466 divisions to discover that the number is composite. It must attempt the same 466 divisions, all failing, for each of the 1481 primes between 10975969 and 11 million, while the Sieve simply observes in each case that the number has not yet been discovered to be composite, and prints it, in O(1) time. For many of the composite numbers, as well, it will take substantial time to discover their composite nature; 10975973, for example, has only three prime factors (101, 109, 997), corresponding to three times that the Sieve will touch its bucket, but its smallest prime factor is 101, the 26th prime, so trial division must try each of the first 26 prime numbers before hitting upon a divisor. So trial division must do eight times as much work for 10975973, even if division is no more expensive than addition.

Most odd composite numbers, of course, are divisible by 3, 5, 7, or 11, so the extra work done is on average not very large.

The consequence is that the Sieve does 49551204 accesses to its table to compute the primes under 11 million, if you do the usual begin-at-i² and ignore-even-numbers optimizations. (I’ve placed my Sieve program at http://canonical.org/~kragen/sw/inexorable-misc/sieve.c.) That’s an average of just over 4.5 operations per candidate number.

By contrast, your trial division code (in the form http://canonical.org/~kragen/sw/inexorable-misc/trialdivision.cc) needs to do 628869700 multiplies and divides, about 57 per candidate number, more than 12× as much. (And, without optimizing, it takes about four times as long as the unheapified Sieve on my machine.)

However, consider the terrible penalty the heapified version must pay: every single one of those bucket operations must pay a heap penalty of O(lg N) heap maintenance operations. By the end, N in this case is 726517, almost a million, so O(lg N) is about 19. And then we start worrying about constant factors: maybe division is slower than addition; but swapping a couple of numbers in a heap is more expensive than just reading one from a table, perhaps; and the heap actually consists of pairs of numbers, so the working set exceeds your L1 and possibly your L2 cache by twice as much as in the trial-division case; C++ compilers are notoriously poor at optimizing STL instantiations (I believe this is documented in Elements of Programming, but if not, it is documented in the earlier draft, Notes on Programming); and heap accesses are more random, so perhaps your cache hit rate will go down further; and so on.

If you could somehow reduce the number of heap operations per bucket knocked out, you could get the best of both worlds: a prime-spewing algorithm that is efficient in both time and space. There’s a straightforward way to do this for the small primes, which is where the problem exists! Instead of testing a *single* candidate number against the top items in the heap, use a fixed-size table of candidates; 4096 items, say, which could be 512 bytes or 4096 bytes, depending on which turns out to be faster. Initially these are the numbers 0 to 4095, later 4096 to 8191, and so on. Then, when you pull a prime out of the heap, you knock out all the candidates in this table, instead of just one. For primes over 2048, there’s no difference, but for 3, 5, and 7 — which, as I said before, are factors of the majority of composite numbers, and therefore account the majority of the heap operations in your code — the difference is about three orders of magnitude.

I’m not at all sure that the heapified version (with a binary heap) is asymptotically faster than trial division. Obviously if you substitute a Fibonacci heap, it’s asymptotically faster.

]]>