A Standard of Excellence

When an industry-wide standard gets established, it is often lamented as a regression to the mediocre, if not outright bad. VHS vs Sony Beta comes to mind. This time, however, I come to praise a standard, not to bury one. I am going to talk about a success from that era: IEEE standard 754 for binary floating-point arithmetic.


Though the standard was only ratified in 1985, most of the work had been done several years before. Up till the late 1970s the computing community was resigned to the situation in which the manufacturers that dominated the market for scientific computers fielded processors with execrably bad implementations of the main tool: floating-point arithmetic. Users had to live with “features” such as:

  • Numbers that tested as nonzero in a comparison, but behaved as zero in division.
  • Numbers that underwent non-negligible change when multiplied by 1.0.
  • X-Y could evaluate to zero for different X and Y.

The whole idea of floating-point arithmetic was that you could just replace the real variables in a mathematically verified algorithm by floating-point numbers. In reality programmers had to insert code like X := (X+X)-X; at critical spots [1].

At the time there was no shortage of experts who could tell what was wrong and how to improve it. One of the best was William Kahan at Berkeley, who was in fact consulting for IBM, with little effect on the egregious shortcomings of their processors, as quality of floating-point processing was not regarded as a competitive advantage. The major players at the time, IBM and CDC, both had bad floating-point arithmetic.

Here we have a glimpse of the situation:

The [CDC] 6600 was the first commercial supercomputer, outperforming everything then available by a wide margin. While expensive, for those that needed the absolutely fastest computer available there was nothing else on the market that could compete. When other companies (namely IBM) attempted to create machines with similar performance, he [Seymour Cray] increased the challenge by releasing the 5-fold faster CDC 7600. [2]

As you can see, speed was all that matters. Speed was something that the VPs in charge of multimillion dollar purchases could understand. They neither knew nor cared what was done at that speed.

To find state-of-the-art floating-point arithmetic one had to look elsewhere, in the unlikely direction of a cheap mini-computer, in this case the PDP-11 from DEC. To the relief of those who cared about quality in computer arithmetic, the subsequent VAX (bigger and faster) had arithmetic that was no worse. But the big players like CDC and IBM, and, by this time, also Cray Computer, felt no need to do anything about their deficient floating-point arithmetic.

While DEC could safely be ignored around 1980, Intel was not even noticed as a potential source of processors. One of the few companies that took Intel seriously was Intel. They planned a floating-point co-processor and retained Kahan as consultant. He advised them to adopt the specifications of the arithmetic of the DEC VAX. This was state-of-the-art, and much better than that of the big players. It was also insanely ambitious for a single chip with a mere 40,000 transistors, which is what Intel had available.

However, Intel declined to follow Kahan’s advice. They did not want the state of the art: they want the best possible. Kahan, working with J. Coonen and H. Stone, produced a specification of a floating-point arithmetic that was the experts’ dream. By the early eighties, when the dinosaurs still roamed the earth [4], Intel had implemented this in 40,000 transistors.

By 1985, when the IEEE standard 754 was finally ratified, there were half a dozen implementations on the market, ominously for the then supercomputers, all these were single-chip CMOS (co-)processors.

In this success story Kahan [1] and other participants emphasize the altruism displayed by their colleagues. The IBM representative was supportive, even though the products of his company at the time were nowhere near compliant. But I can’t help taking a less rosy view of the situation: all participants had a vested interest in avoiding the situation that the same numerical program would give different results on their different computers. I can’t help thinking that they were more worried about results seen to be wrong than about them actually being wrong. As long as the machines wouldn’t contradict each other, all would be fine.

A program giving significantly different results on different machines is likely to be unsound numerically; it may well be that the different results are all wrong, including the result obtained by an IEEE-754 compliant machine. Such a program needs to be analyzed by an expert and rewritten. This would be expensive. The “old anarchy”, much deplored by those who celebrate IEEE 754, could actually be a valuable warning that the algorithm underlying the program is not stable or that the problem is ill-conditioned. Not welcome news. The Brave New World of IEEE 754 prevents such warnings from ever arising.

The Standard Methodology is:

Design an algorithm in terms of real numbers that is stable. Replace the reals by floating-point numbers; replace operations on reals by the ones on the floating-point numbers that have the same name. Apply to a model that is numerically well-conditioned.

The problems with this are (a) that the stability depends on properties of the reals (like associativity of operators) that their floating-point counterparts do not have and (b) that it is often not easy, even for an expert, to determine whether a model is well-conditioned. For example, for some computations the condition depends on a certain eigenvalue of a matrix. And it may be that determining that eigenvalue is harder than the problem you want to solve in the first place.

Thus, a valuable supplement to the Standard Methodology may be to run your program program on a CDC 6600 as well as on an IBM 360, both bad, and in different ways. If the results do not differ significantly, then this can be used to boost confidence. Otherwise, it should be “back to the drawing board”. Intel, in a document celebrating their heroic role in IEEE 754, includes a very apt warning:

“The standard doesn’t provide a guarantee that the answers are the “right” answer. That may require careful error analysis. However the standard provides many mechanisms that make it easier for a program to get the “right” answer on conforming computers from different vendors.” — John Crawford [3]

Yes, and you need an expert for the required careful analysis. Every engineer has gigaflops on his desktop, but there’s not an expert in sight.

Enough of these dark thoughts. What is so wonderful about IEEE 754? Admirers of the standard may give different answers; here I’ll give mine.

  • The reasonable accuracy achievable by a guard digit, which had been known for decades, but ignored by most of the industry, is guaranteed. The guarantee can be expressed as: if the floating-point result of an operation coincides with the real-valued result, then that is the result; otherwise it is one of the bounds of the smallest floating-point interval containing the real-valued result. And you can pick which of the bounds you want; see next.
  • Rounding mode is not fixed. It can be upward, downward, towards zero, or to nearest. This can be changed at runtime.
  • One of the quirks of number systems is that you have to have an asymmetric collection of numbers if you, correctly, want a single zero. Then you necessarily have a zero in the middle and symmetrically on either side the positive and negative numbers. If you want symmetry, then you must have either no zeroes, or two, a positive and negative one. This latter option seems to be inherent in the notion of a floating-point number system, so it’s not surprising that IEEE 754 has a positive and a negative zero.But the standard provides for the natural counterparts of the zeros, which are the infinities, positive and negative, so that division by a zero of a nonzero finite number gives you infinity of the right sign and so that division of such a number by infinity gives the zero of the right sign. In this way computation can continue after division by zero if that makes sense in your model, or can be trapped if not.

In a future post I hope to be able to say something about the Standard Methodology of numerical computation and its alternatives. In the Appendix I present in 27 lines the salient details of IEEE 754. I feel that these are also the most important details of IEEE 754-2008, the current version.


[1] http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html
[2] http://en.wikipedia.org/wiki/Seymour_Cray
[3] http://www.intel.com/standards/floatingpoint.pdf
[4] The last of Seymour Cray’s companies, Cray Computer Corporation, filed for bankruptcy in 1995 (from Wikipedia article “Seymour Cray”). Behind the current “Cray” computers is a succession of other companies that passed on the rights to the “Cray” marque.


By now the 1985 IEEE 754 has been replaced IEEE 754-2008. It’s a fat wad of paper not particularly encouraging for those whose curiosity has been aroused by the story so far. Apart from some niggling details, the original is a subset of the current standard. It so happens that I advocate a model of computing, about which more later, that allows one even to ignore most of the much smaller old standard. All that I want to know about floating-point arithmetic is the part of the standard described in the following few lines of ASCII-art, lifted straight out of my code:

/* Precis of the IEEE standard 754 format
Layout of fields:
Single-length  31-31: sign  30-23: exponent  22-0: significand
Double-length  63-63: sign  62-52: exponent  51-0: significand
Least significant bit: 0
Most significant bit: 31 (single-length), 63 (double-length)

Single-length                Double-length               Value
------------------------   ---------------------------
 exponent    significand   exponent        significand
U     U-Bias              U      U-Bias
0     -127        0       0      -1023         0         0
0     -127      nonzero   0      -1023        nonzero   see Denormalized
1-254 -126-+127 anything  1-2046 -1022-+1023  anything  see Normal
255   +128       0        2047   +1024         0        infinity
255   +128      nonzero   2047   +1024        nonzero   NaN

Sign bit:  1 ~ negative    0 ~ positive
Significand: the bits after the binary point
Bias:    single-length, 127       double-length, 1023
Emin:    single-length, -126      double-length, -1022
Columns U: value of exponent field as unsigned integer
Columns U-Bias: as above, with Bias subtracted
Normal: (1+0.significand)*2^(exponent-bias)
Denormalized: (0.significand)*2^Emin

The single-length format occupies 32 bits, numbered from 31 for the most significant to 0 for the least. Double length has 64 bits, numbered similarly. Emin is the least value of the exponent. Bias is the number to be subtracted from the exponent field (interpreted as unsigned number) to get the exponent.

The exciting thing about this little table is that it lets you have a peek under the hood of the computer on your desk. Let’s have a look at what one tenth looks like in IEEE 754. In the following C program one tenth in single-length floating-point is aliased to an unsigned integer and then printed out in hexadecimal. This way we get to see the bits, which we then interpret with above table.


int main() {
union fltInt {float F; unsigned U;} x;
x.F = 0.1;
printf("%f %x\n", x.F, x.U);
/* Output:
0.100000 3dcccccd

Hm … can we make sense of 3dcccccd? First lay it out in bits:

 3    d    c    c    c    c    c    d
0011 1101 1100 1100 1100 1100 1100 1101

Then re-arrange according to single-length floating-point format:

sign  exponent    significand
0     01111011    10011001100110011001101

To make sense of the significand (the fractional part after the binary point) we have to realize that what we see here is not the one that belongs to one tenth, because you cannot represent it as a binary numeral. As one tenth is a rational number, it is an infinitely repeating binary numeral. As ten only has small prime factors, there have to be several repetition cycles here. Hence 3dcccccd has to be the rounded version of

sign  exponent    significand
0     01111011    10011001100110011001100110011001...ad inf

The sign bit is 0, which means positive (see table). The exponent is 123 as unsigned integer; -4 after subtracting the bias of 127 (see table). According to formula Normal in the table, the value is

    + 2^{-4} * 1.10011001100110011001100110011001...ad inf
=   + 2^{-4} * (1 + (1 + 9/16 + 9/(16^2) + ...)/16)
=   + 2^{-4} * (1 + 3/5)
=   + 1/10.

The infinite binary expansion of 1/10 is between 3dcccccc and 3dcccccd. It is closer to the latter. The default rounding mode is “round to nearest”. As the program did not change the rounding mode, we should find 3dcccccd, which we did.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: