Unprotected Numerical Computation and Its Safe Alternatives

Conventional numerical computation is marvelously cheap, and it seems to work most of the time. The computing profession has long neglected to develop methods appropriate for those situations where cheapness is not an overriding concern and where “seems to work most of the time” is not good enough.

Many textbooks in numerical analysis start by showing that in certain situations the small errors caused by the use of floating-point arithmetic can rapidly grow and render the result of the computation meaningless. Such textbooks describe conditions under which mathematical algorithms can safely be transplanted to floating-point hardware: that the problem be well-conditioned and that the algorithm be stable. In the early days computing centres employed numerical analysts to make sure that none of the scarce processor cycles got wasted on meaningless results caused by ill-conditioned problems or unstable algorithms.

Much has changed since then. Thousands of scientists and engineers have on their desks gigaflops of computing power, and there is not a numerical analyst in sight. Although an even larger part of problem-solving is entrusted to the computer, the same fragile methodology is followed. And still the only justification is that conventional numerical computation seems to work most of the time.

Robust alternatives have existed for decades. These are not magical cures for ill-conditioned problems and unstable algorithms. They are robust in the sense that it is impossible, or near-impossible (depending on the alternative), that an incorrect result is returned without a clear warning. A responsible response in such a case is to critically examine the model (is it well-conditioned?) and the algorithm (is it stable?).

Such robustness comes at the price of increased demands on the processor(s) and memory. So far practitioners have invariably preferred to use the abundant increase in computing capacity for running the same problem with a finer mesh rather than for the security attainable by robust methods.

In science this preference is not likely to change: by nature, scientists feel at home with Rube Goldberg devices held together by sealing wax and string. Engineers operate in a different world. In a court of law, a professional engineer can only appeal to his “best effort” in using software that is not worse than what everybody else uses. In turn, these widely accepted packages can only appeal to being best efforts in selecting algorithms that are numerically stable. It is up to the user to select an algorithm appropriate to the problem and to ensure that the problem is well-conditioned. Even for a numerical analyst this is sometimes hard to tell; harder than solving the problem itself. The time may eventually come that these facts penetrate to the engineer’s mind.

The above should not be a revelation. A promotional document of Intel says it well:

The standard doesn’t provide a guarantee that the answers are the “right” answer. That may require careful error analysis. However the standard provides many mechanisms that make it easier for a program to get the “right” answer on conforming computers from different vendors. — John Crawford [1]

Like many manufacturers who warn that you “may need to consult your physician” for safe use of the product, Intel says that your problem “may require careful error analysis”. Yes, and who can tell you whether your problem actually does fall into this dread category? That would have to be one of those rare and expensive numerical experts.

Before we continue, let me try to state briefly what I mean by “conventional numerical computation”. Let’s say we are modeling an equilibrium state. It could be the state of a petroleum refinery or of an aircraft in flight. Although these are dynamical systems, suppose we are interested in the static state of equilibrium. The state variables have to satisfy a system of equations (let’s say, nonlinear algebraic equations), and we need to solve the system. Unless the system is trivial, an algorithm for solving it will be iterative. That is, the algorithm itself is a dynamical system. For the algorithm to be considered at all, it has to be a dynamical system that evolves by its own dynamic law to an equilibrium state, and this needs to coïncide with the solution of the system of equations.

This is what numerical analysts do for a living: invent such dynamical systems in the form of algorithms. In our example, this could be the n-dimensional version of Newton’s method. A proper textbook will not only describe the algorithm, but also conditions under which it will converge. What’s missing from the conditions is the assumption that holds for most of the entire textbook: all variables range over the real numbers and all operations are the ones defined for the real numbers. That is, the algorithm itself is an abstract dynamical system; its properties hold in the abstract setting. The properties have been proved with some manipulation of formulas. In this manipulation it is taken for granted that certain laws, like associativity of addition and multiplication, hold. And they do: the proof is given for the abstract dynamical system.

So far so good. Now comes the sleight of hand: every real variable is replaced by a floating-point variable. Moreover, every operation on the reals is replaced by … by what? That’s hard to say more exactly than: the operation on floating-point numbers that has the same name as one of the operations on the reals. This is what you are up to: prove that the resulting concrete dynamical system, the one that takes the computer through its paces, converges, and converges to a state near to the state of its abstract counterpart.

In the most favourable case, the abstract dynamical system has been proved correct. Most people don’t even realize that what goes on in the computer is not what has been proved correct. That it is not just a matter of the abstract dynamical system’s variables being jiggled a bit by rounding, but where even the operations themselves are perturbed. This is what I call unprotected numerical computation. The fact that it works most of the time is a bloody marvel.

Before this is getting too long, I should do two things. One is to show some concrete examples of computations going off the rails. The other is to give you some idea of safe alternatives to unprotected numerical computation.

All users appreciate that answers can be off because rounding errors, though small, can accumulate. Some reason like “My computation takes about half an hour, say a thousand seconds. At a gigaflop per second that adds up to 10^12 operations. Double-length arithmetic gives 10^-15 relative error. Hmm … could I get a total relative error of 10^-3? Not good … but wait! These errors cancel each other most of the time, so I’ll end up not much worse than 10^-15 relative error”. As chapter one of the numerical analysis textbook explains, this error model is not satisfactory. Appendix A shows an example (attributed to S. Rump of Hamburg) where a few dozen double-length operations are enough to wreak total havoc.

What gives away the evil expression in Appendix A is the fact that the results for single length and double length are different. However, as Parker [2] shows, this is no guarantee against error. See Appendix B.

Are there alternatives to the Conventional Method? I’m aware of two: interval arithmetic [4] and Monte Carlo arithmetic [2, 3]. Let’s start with the latter. Remember that what motivated the competing computer manufacturers to collaborate on a standard for floating-point arithmetic was that their products gave different answers for the same problem while none of the manufacturers was confident that their own were any better. As long as everyone’s results were the same, all would be well, at least on the sales front.

The big boys in numerics in pre-standard time were CDC, IBM, and Cray. All three had bad floating-point arithmetic compared with the standard to come, and bad in different ways. At least theoretically it was possible for an inmate of an IBM installation to go over to the neighbours with a CDC and ask for the favour of running the program. If the results didn’t differ too much, then you could be reasonably confident that the results were not far from correct. But now your IBM is IEEE standard and whatever the neighbours have, theirs is IEEE standard too. To the industry’s great relief, it speaks with One Voice.

Admittedly, running over to the neighbours for a check is not practical. And, as Intel’s Palmer said in the above quote, the standard makes it easier to develop safe alternatives to unprotected computation. Monte Carlo arithmetic [2,3] is a case in point. It uses the idea of running over to the neighbours for a check, but thanks to the standard, you can get a mild version of a different arithmetic right in your own processor, even without interrupting your computation. Monte Carlo arithmetic makes use of the standard’s provision to change rounding mode under program control while the program is running. Repeated runs can then be regarded as a sample and analysed statistically. I count this as a safe alternative because an unduly high standard deviation casts doubt on the validity of the result.

The other safe alternative is interval arithmetic [4]. Here one operates on intervals of numbers rather than on individual numbers. Let [x,y] assume x not greater than y and let it stand for the interval of all numbers from x to y. Then, for example, interval subtraction turns out to be [a,b]-[c,d] = [a-d,b-c]. The bounds a-d and b-c in the result are computed in floating-point arithmetic and are in general subject to rounding error. The standard makes it possible to compute a-d in rounding mode towards minus infinity and b-c in rounding mode towards plus infinity. Rounding introduces uncertainty about a result. By picking the rounding modes in this way, we are assured that, in spite of the unavoidably introduced uncertainty, the result interval contains all possible values. In this sense interval arithmetic is a safe alternative.

Of course, in a problematic problem, neither Monte Carlo nor interval arithmetic can give you the solution. But either will tell you that you have a problem. In Monte Carlo arithmetic in the form of an excessive standard deviation; in interval arithmetic in the form of an excessively wide interval. For example, interval evaluation of function f1 in Appendix A results in an interval with a width in the order of 10^21; a clear warning, if nothing else.

The excessive width is due to catastrophic cancellation of figures. Suppose we do not have significant cancellation, but the typical large number of operations that we have our gigaflops for. Then the accumulation of the small rounding errors does become a problem, because in interval arithmetic they are all in the same direction for each of the bounds. As a consequence, the result interval, though correct, is much too wide. It is so wide because it reflects the (unlikely) worst case of the unprotected version of the same computation. In this respect Monte Carlo arithmetic is better: it gives a single result with an error under statistical control.

I would not recommend merely intervalized versions of conventional algorithms. Interval arithmetic is of great interest, but only because of the algorithms it makes possible that are not even dreamt of in conventional computation. For example, algorithms that give intervals for all zeros of a system of nonlinear equations; algorithms for the hardest case of mathematical programming: constrained global optimization for objective functions with large numbers of local extrema. When used with the variant of interval arithmetic that is known as interval constraints [5] one can even mix in integer variables in the type of optimization just mentioned.

It is time to dump the fiction that we can perform the error analysis that would be necessary to justify unprotected numerical computation. The safe alternatives are safe in that they warn of trouble and that they can be relied on in the absence of warning. To get this certainty we have to live with the possibility that the safe method gives a warning while unprotected execution gives a correct result.


[1] http://www.intel.com/standards/floatingpoint.pdf
[2] “Monte Carlo Arithmetic: exploiting randomness in floating-point arithmetic” by D. Stott Parker, Research Report CSD-970002, UCLA Computer Science Department, 1997.
[3] “Monte Carlo arithmetic: how to gamble with floating-point and win” by D. Stott Parker, Brad Pierce, and Paul R. Eggert; Computing in Science and Engineering, Vol. 2, No. 4, pp. 58-68, July/August 2000.
[4] Introduction to Interval Analysis by R.E. Moore, R.B. Kearfott, and M.J. Cloud; SIAM, 2009.
[5] Numerica: A Modeling Language for Global Optimization by P. Van Hentenryck, Y. Deville, L. Michel; MIT Press, 1997.


Thanks to Dr E.J. van Kampen of the Technical University in Delft for checking out some results in interval arithmetic.

Appendix A

In the C++ program below the function f1 requires a few dozen floating-point operations. The expression in f1 happens to simplify to the one in function f0. If the two functions give different results, as they do here, then the latter is more likely correct.

#include <stdio.h>

template<class flpt>
flpt f1(flpt x, flpt y) {
  flpt x2 = x*x;
  flpt y2 = y*y, y4 = y2*y2, y6 = y2*y4, y8 = y4*y4;
  return (333.75-x2)*y6 + x2*(11*x2*y2-121*y4-2) +
         5.5*y8 + x/(2*y);
template<class flpt>
flpt f0(flpt x, flpt y) { return (-2+x/(2*y)); }
int main() {
  printf("%e  %e  %e\n",
/* Output:
-2.341193e+29  1.172604e+00  -8.273961e-01

Appendix B

#include <iostream>
#include <iomanip>
using namespace std;

template<class flpt>
flpt f(flpt x0, flpt x1) {
  return (111.0 - 1130.0/x1 + 3000/(x0*x1));

int main() {
  cout << setiosflags(ios::fixed) << setprecision(8);
  float f0, f1, tempf; double d0, d1, tempd;
  const int N = 30;
  f0 = d0 = 2;
  f1 = d1 = -4.0;
  for (int i = 1; i <= N; i++) {
    tempf = f1; tempd = d1;
    f1 = f<float>(f0,f1); d1 = f<double>(d0,d1);
    f0 = tempf; d0 = tempd;
  cout << f1 << "  " << d1 << endl;
/* Output:
100.00000000  100.00000000

This C++ program computes the same short iteration in both single and double length, with good agreement. However, the true value is

990176025870222717970867 / 164874117215934539909207 =


See Parker [2].

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: