Another Scoop by Dijkstra?

Edsger W. Dijkstra (1930–2002) is known and remembered for many things in programming and in computing. But it seems that the problem of efficiently generating the sequence of prime numbers (2, 3, 5, 7, …) is not among them. I recently re-read his 1972 Notes on Structured Programming [5] and noticed for the first time, tucked away in a few lines, a remarkable insight that resolves the stand-off between the Sieve of Eratosthenes (efficient in terms of time, but not memory) and the method of Trial Division (efficient in terms of memory, but not time).

Primes: from tables to programs

First some background. Prime numbers have been bugging (some) humans for many centuries. The great Gauss himself was among those who fell under the spell of the primes. For lack of a published table of primes he made one up for himself; he pored over its columns in search of a pattern. In despair he gave up, concluding that it looked like the numbers were generated by some demon throwing dice. And, bingo, this probabilistic model led him [7] to the famous conjecture about the distribution of primes that turned out to be correct, but took almost a century to be proved.

Given this history, it is not surprising that those bitten by the bug of the primes welcomed the computer as an alternative to printed tables. Jacob Philip Kulik (1793 – 1863) devoted twenty years to a manuscript from which a printed table of the prime numbers up to eleven million was prepared [1]. Errors were corrected in the process; unkind souls allege that many remain. By the time computers arrived, more dependable sources existed, though few and far between. The process by which they were created inspired more confidence than poor Kulik slaving away in solitude, but worries about accuracy remained.

With the computer, a new phenomenon arose: the possibility of a program that generated in a matter of minutes (now seconds) all primes up to eleven million, the range of Kulik’s table. Now, rather than rely on authority, we can convince ourselves of the correctness of the table by studying the page or so of code that it takes to generate the first N primes. Still, even now there are trusting souls who put their faith in tables. For example, I found on the web an applet claiming to list the primes up to a hundred million, with that range served up in convenient chunks of a thousand at a time. But wouldn’t you rather rely on twenty lines of code that do the same job? (Appendix A).

The function in Appendix A is simple, and wasteful. For example, soon after eliminating all factors 2, it tries to divide by 4. It continues spending much of its time performing such superfluous operations. Yet it is quite effective for undemanding tasks like serving up the factorizations of, say, a thousand numbers. For reasons of uniformity, the code in Appendix A has been dumbed down to just indicating primality.

Trial Division versus Sieve of Eratosthenes

When we eliminate the redundancy of the function in Appendix A, we arrive at the method of Trial Division, which only tries the successive primes as potential divisors. The primes that it needs are generated by itself in an earlier iteration. Briefly, Trial Division works like this. Starting from the last prime found, generate as candidate primes the successive odd numbers and test them for divisibility by all the primes in succession. If no divisor is found by the time you reach the prime that is just greater than the square root of the candidate, you can stop and add the candidate as next prime to the table (Appendix B).

Trial Division is one of the stock examples in programming courses. Another is the Sieve of Eratosthenes, another algorithm that generates the sequence of primes. It is one of the earliest algorithms and was invented thousands of years ago. The name of the algorithm suggests that you imagine a sieve into which you drop all the numbers. You shake the sieve, and then the prime numbers come falling through. It helps to imagine the sieve as a superposition of sub-sieves: S2, S3, S5, S7, … Sub-sieve S2 catches all multiples of 2, S3 catches all multiples of 3, and so on for 5, 7, 11, … Each sieve keeps back the multiples of the prime it is dedicated to, so that only primes get through.

The process can be illustrated by the following bit of ASCII art, depicting a sort of Morse-code rendering of the Sieve of Eratosthenes.

```   ____________________________________________________________________________

0000000011111111112222222222333333333344444444445555555555666666666677
2345678901234567890123456789012345678901234567890123456789012345678901

S2  -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
S3  .-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..-..
S5  ...-....-....-....-....-....-....-....-....-....-....-....-....-....-.
S7  .....-......-......-......-......-......-......-......-......-......-.
S11 .........-..........-..........-..........-..........-..........-.....

Figure 1.
The Sieve of Eratosthenes consisting of superposition of the sieves S2, S3,
S5, S7, and S11. Each column consisting of dots only corresponds to a prime.
____________________________________________________________________________```

Read the top two lines vertically, giving the numbers to be sieved: 02, 03, …, 10, 11, …, 70, 71. By default, numbers are rendered as dots. In the line for S2 the dot at each multiple of 2 has been replaced by a dash; the one after that has a dash at each multiple of 3, and so on for 5, 7, and 11. As a result, any column with dots only indicates that the number labeling the column is a prime.

The number theorists Derrick Lehmer (father and son) built devices based on bicycle chains with rods at certain significant intervals. The assembly could be set up in such a way that it acted as a sieve of Eratosthenes, or as a generator of solutions of a Diophantine equation [2].

To simulate the Sieve on a computer you could use an array starting with all zero bits and write ones every p locations, for p equal to 2, 3, 5, 7, and so on. Where Trial Division uses divisions, the Sieve generates the various sieves by means of additions. The Sieve used to be preferred because divisions were slower than additions. In the old days not all computers had hardware to perform the divisions required by Trial Division, so there could be a huge difference in speed.

The Assembly-Line algorithm

If you want to generate all primes up to N, then the Sieve method requires an amount of memory proportional to N. Trial Division needs memory for the table of the first primes, and this table need only have a size of the square root of N. So here is the stand-off: the Sieve is faster, while Trial Division needs less memory.

The difference between the two methods is an example of a phenomenon that is not restricted to computing. Imagine that you have to manufacture a large number N of items. Suppose that each item requires M operations A, B, C, … to be performed on it. Imagine N in the thousands and M to be about the square root of N. Suppose furthermore that each operation requires some start-up time. For example you need to consult the cheat sheet on how to perform the operation and you have to get the tools ready. As a result it takes considerably less time to repeat the same operation on successive items than the same number of different operations on the same item.

This situation suggests two strategies:

1. minimize time — line up all N items and perform on each in turn A, then on each in turn operation B, and so on.
2. minimize number of items in process — perform operations A, B, C, … on the same item, and get it out of the way.

Of course we want a method that combines the advantages of both approaches.

Primes to be generated can be likened to items to be manufactured. The operations are ensuring non-divisibility by 2, by 3, by 5, and so on. The Sieve method is like laying out all items simultaneously. As a result, ensuring non-divisibility can be done fast: only additions are needed. Trial Division is like performing all operations on a single item before proceeding to the next. Because ensuring non-divisibility by a given prime is done in isolation from the same operation on the other items, the division operation is needed, which is more expensive than addition. As in manufacturing, the disadvantages of both strategies suggest looking for a way that combines the advantages of both approaches.

In manufacturing the assembly line has been invented to combine speed with reducing the number of in-process items. In an assembly line there is one worker for each operation, who only performs that operation on the items as they are successively presented by a traveling belt. Operations are performed at maximum speed. The device of the assembly line reduces the number of in-process items from N to M.

Let us translate the idea of the assembly line to the Sieve of Eratosthenes as illustrated in Figure 1. Suppose we want to check whether N = 36 is a prime. The workers on the assembly line are represented as the set of multiples of 2, 3, 5, 7, and 11 that are in column N or the first such to the right. For N = 36 these are 36, 40, 42, and 44. As one of the multiples is equal to N, N is dismissed as non-prime. N is incremented. Any multiples that are to the left of N are incremented by the corresponding prime; in this case 36 is incremented to 38 in its role as multiple of 2 and to 39 in its role as multiple of 3. Now we have that N = 37 is less than the least multiple under consideration; this proves it to be prime.

The operations here are addition and identification of the least of the multiples. This latter operation is potentially significant: for N at eleven million there are over three thousand multiples. Therefore the multiples should be organized as a heap.

See Appendix C for further explanation and a C++ implementation.

Whence the Assembly-Line algorithm?

What should we acknowledge as the source of the Assembly-Line algorithm? Before I noticed Dijkstra’s hidden gem, the only source I knew for the assembly-line algorithm was a 2008 paper by Melissa O’Neill [3]. O’Neill refers to the analogue of the assembly line as “just-in-time”. This sounds like an analogy with manufacturing, but it is not. “Just-in-time” manufacturing refers to coordination between factories, not to what happens within a single factory. Apart from this discrepancy, “just-in-time” is an apt characterization of a balanced assembly line.

O’Neill gives no source for the Assembly-Line algorithm, so I assume she is one of the inventors of it. The book by Crandall and Pomerance [4] seems unaware of it. Their pseudo-code for the Sieve is unclear about details. I surmise they were not aware of the Assembly-Line algorithm because of their comment on page 121 “The biggest computer limitation on sieves is the enormous amount of space they can consume.”

The earliest description of the Assembly-Line algorithm that I know of is Dijkstra’s essay “Notes on Structured Programming” [5]. Dijkstra’s discovery may not have received the attention it deserved because it was not presented as a novel solution for an important problem in prime-number generation. Dijkstra just wanted to describe in detail how to go about systematically developing an algorithm, some algorithm. He picked the generating of prime numbers as an example that is somewhat interesting, yet short enough to fit in the report he was writing.

For his example of systematic program development Dijkstra chooses the Trial Division algorithm. In an aside he remarks that the friends to whom he showed the draft objected that “Everybody knows that the Sieve is more efficient”, and refused to read further. After meticulously motivating each tiniest step, Dijkstra almost completes an algorithm for the Trial Division method. The last step is the actual division.

At the last step, implementing how to test for divisibility the candidate for the next prime, Dijkstra fills in the obvious modulo instruction. This would have resulted, if transposed to the C programming language, in something like the code in Appendix B. But then he says (page 37): “To give the algorithm an unexpected turn we shall assume the absence of a convenient remainder computation.” He continues with a brief outline of how he could continue the development of his algorithm to one that maintains a location for each prime and ensures that each of these locations contains the least multiple of the prime that is not less than the number under consideration. These multiples need only an addition to be updated; a division is never needed. Yet he never needs more multiples of primes in memory than the square root of the size of the table to be produced.

There it is: among many other things, the Master invented prime-number generation according to the assembly-line principle. The only thing that is missing is a mention of the need to repeatedly find the smallest of these multiples; that is, the need of a suitable data structure such as a heap.

The Assembly-Line algorithm in action

It may be that in the early seventies the expected speed of the Sieve was only attributed to the speed of addition. Since then all algorithms have been subjected to asymptotic analyses. As [3] describes, the Sieve is not just faster by a constant factor, but is also asymptotically faster. Encouraged by this scientific nugget, I implemented (Appendix C) the Assembly-Line algorithm, using STL, the standard library for C++, for the required heap data structure. I compared its results with the those of the program in Appendix B on the re-computation of Kulik’s table, all prime numbers up to 11 million.

To my surprise, Trial Division, though asymptotically slower, does it about five times as fast as the Assembly-Line algorithm. Of course, being asymptotically faster does not mean faster on small inputs. Apparently, for this problem, 11 million is still a small input, too small for the asymptotic difference in speed to prevail.

Winding up

In winding up I need to report that, to prevent this from getting even longer, I omitted another fascinating skirmish of Dijkstra’s with prime numbers. This is Chapter 17 (“An exercise attributed to R.W. Hamming”) in A Discipline of Programming [6]. Here he uses again what I call the assembly-line idea, but without acknowledging any connection with prime numbers and missing the opportunity to use addition.

Another time, perhaps.

References

[1] Waclaw Sierpinski (Andrzej Schinzel, ed.): Elementary theory of numbers. Elsevier, 1988, page 119.
[2] http://en.wikipedia.org/wiki/Lehmer_sieve (December 31, 2010)
[3] O’Neill, Melissa E., “The Genuine Sieve of Eratosthenes”, Journal of Functional Programming, Published online by Cambridge University Press 9 October 2008.
[4] Richard E. Crandall and Carl Pomerance: Prime numbers: a computational perspective. Springer-Verlag, 2005.
[5] O.J. Dahl, E.W. Dijkstra, and C.A.R. Hoare: Structured Programming. Academic Press, 1972.
[6] E.W. Dijkstra: A Discipline of Programming. Prentice-Hall, 1976.
[7] Marcus du Sautoy: The Music of the Primes. HarperCollins, 2003.

Appendix A

A C++ program to print the prime numbers in a block of one thousand.

```#include <iostream>

using namespace std;
typedef unsigned nat; // natural number

bool prime(nat n) {
nat n0 = n, factor = 2, product = 1;
while (factor*factor <= n) {
while (n%factor == 0) {
n /= factor; product *= factor;
}     factor++;
}
return ((n != 1) && (n == n0));
}
void primes(nat n) {   if (n >= 1000)
for (nat i = n-999; i <= n; i++)
if (prime(i)) cout << i << endl;
}```

Appendix B

```typedef unsigned nat; // natural number

void prTable(nat p[], nat n) {
// Place the first n prime numbers in p[0..n-1]
// using Trial Division.
nat k;
assert(n > 1);
p[0] = 2; p[1] = 3;
k = 2;
while (k < n) {
// p[0..k-1] are first k primes
nat cand = p[k-1]+2;
nat j = 0;
while (p[j]*p[j] <= cand) {
// p[0..j] do not divide cand
j++;
if (cand%p[j] == 0) {
j = 0; cand += 2;
}
}
p[k++] = cand;
}
// p[0..k-1] are first k primes
// k == n
}```

Appendix C

The Assembly-Line algorithm.

To determine whether N is a prime, we maintain as a heap S(N), a set containing a multiple of each of the primes up to the ceiling of the square root of N. We can increment or decrement a multiple; this means increment or decrement (if possible) by the prime corresponding to the multiple. We say that a multiple in S(N) is “too big” (with respect to N) if it can be decremented without making it less than N. We say that a multiple in S(N) is “too small” (with respect to N) if it is less than N. The theorem that is at the basis of the Assembly-Line algorithm says that N is prime if no multiple in S(N) is too big and if N is less than the least multiple; that is, less than the top of the heap.

An algorithm that starts with S(N) that does not contain a multiple that is too big can maintain this property by only incrementing multiples that are too small. This suggests as algorithm for determining the next prime at or after N.

```Ensuring none of its multiples is too big,
create S(N) as a heap; // smallest multiple at top

while (N >= top of heap) {
// There may be a multiple that is too small.
while (N > top of heap) {
// There is a multiple that is too small.
increment top of heap
} // No multiple is too big.
if (N == top of heap) {
// N is not a prime; try next N.
N = N+2;
// No multiple is too big.
}
}
// No multiple is too big and N < top of heap
// Therefore N is prime.```

C++ code:

```#include <iostream>
#include <vector>
#include <algorithm>
#include <utility>

using namespace std;
typedef unsigned nat; // natural number
typedef pair<nat,nat> pnn;

bool gt(pnn x, pnn y) {
return x.first > y.first;
}
void primes(nat table[], nat n, nat lim) {
// place first n primes in table[0..n-1]
// ensure p[n-1] <= lim
table[0] = 2;
table[1] = 3;
vector<pnn> heap; vector<pnn>::iterator i;
nat max = 0; // max prime for heap
while (max*max <= lim) max++;
heap.push_back(pnn(2*2,2));
heap.push_back(pnn(3*3,3));
push_heap(heap.begin(), heap.end(), gt);
nat N = 5, count = 2;
while (count < n) {
while (N >= heap.begin() -> first) {
while (N > heap.begin() -> first) {
// increment least multiple by corresponding prime
pop_heap(heap.begin(), heap.end(), gt);
vector<pnn>::iterator last = heap.end()-1;
last -> first +=  last -> second;
push_heap(heap.begin(), heap.end(), gt);
}
if (N == heap.begin() -> first)
// N not a prime; try next candidate
N+=2;
}
// N < heap.begin() -> first
// N is prime
table[count++] = N;
if (N <= max) {
heap.push_back(pnn(N*N,N));
push_heap(heap.begin(), heap.end(), gt);
// new pair absorbed in heap
}
// N is a prime; try next candidate anyway
N += 2;
}
}
```

Thanks to Paul McJones and to Tomas Bednar for pointing out an error in the first posted version.

2 Responses to “Another Scoop by Dijkstra?”

1. Kragen Javier Sitaker Says:

First, the more important thing O’Neill points out in her delightful paper is that trial division is slower than the Sieve not just because the individual operations are slower, but because it performs many fewer of them.

Consider, as an extreme case, the number 10975969, which is the square of the prime 3313, the 466th prime. The Sieve never touches bucket 10975969 until it reaches 3313, at which point it squares 3313, sets the bucket, adds 3313, discovers that the result is over 11 million, and stops. By contrast, trial division must attempt 466 divisions to discover that the number is composite. It must attempt the same 466 divisions, all failing, for each of the 1481 primes between 10975969 and 11 million, while the Sieve simply observes in each case that the number has not yet been discovered to be composite, and prints it, in O(1) time. For many of the composite numbers, as well, it will take substantial time to discover their composite nature; 10975973, for example, has only three prime factors (101, 109, 997), corresponding to three times that the Sieve will touch its bucket, but its smallest prime factor is 101, the 26th prime, so trial division must try each of the first 26 prime numbers before hitting upon a divisor. So trial division must do eight times as much work for 10975973, even if division is no more expensive than addition.

Most odd composite numbers, of course, are divisible by 3, 5, 7, or 11, so the extra work done is on average not very large.

The consequence is that the Sieve does 49551204 accesses to its table to compute the primes under 11 million, if you do the usual begin-at-i² and ignore-even-numbers optimizations. (I’ve placed my Sieve program at http://canonical.org/~kragen/sw/inexorable-misc/sieve.c.) That’s an average of just over 4.5 operations per candidate number.

By contrast, your trial division code (in the form http://canonical.org/~kragen/sw/inexorable-misc/trialdivision.cc) needs to do 628869700 multiplies and divides, about 57 per candidate number, more than 12× as much. (And, without optimizing, it takes about four times as long as the unheapified Sieve on my machine.)

However, consider the terrible penalty the heapified version must pay: every single one of those bucket operations must pay a heap penalty of O(lg N) heap maintenance operations. By the end, N in this case is 726517, almost a million, so O(lg N) is about 19. And then we start worrying about constant factors: maybe division is slower than addition; but swapping a couple of numbers in a heap is more expensive than just reading one from a table, perhaps; and the heap actually consists of pairs of numbers, so the working set exceeds your L1 and possibly your L2 cache by twice as much as in the trial-division case; C++ compilers are notoriously poor at optimizing STL instantiations (I believe this is documented in Elements of Programming, but if not, it is documented in the earlier draft, Notes on Programming); and heap accesses are more random, so perhaps your cache hit rate will go down further; and so on.

If you could somehow reduce the number of heap operations per bucket knocked out, you could get the best of both worlds: a prime-spewing algorithm that is efficient in both time and space. There’s a straightforward way to do this for the small primes, which is where the problem exists! Instead of testing a single candidate number against the top items in the heap, use a fixed-size table of candidates; 4096 items, say, which could be 512 bytes or 4096 bytes, depending on which turns out to be faster. Initially these are the numbers 0 to 4095, later 4096 to 8191, and so on. Then, when you pull a prime out of the heap, you knock out all the candidates in this table, instead of just one. For primes over 2048, there’s no difference, but for 3, 5, and 7 — which, as I said before, are factors of the majority of composite numbers, and therefore account the majority of the heap operations in your code — the difference is about three orders of magnitude.

I’m not at all sure that the heapified version (with a binary heap) is asymptotically faster than trial division. Obviously if you substitute a Fibonacci heap, it’s asymptotically faster.

2. Kragen Javier Sitaker Says:

By the way, when I said “Elements of Programming” in my comment above, I was referring to Stepanov and McJones’s book, not your own book by the same name. I apologize for the ambiguity, and I wish Stepanov and McJones had chosen a non-colliding name.