Computer Math is Not Like Real Math

Numerical Programming

This code snippet repeatedly divides a value by 2 and tests if the value remains larger than 0. For most developers it's no surprise that the value becomes so small that the computer eventually discards it - this is underflow rounding error.

In 64-bit double-precision, the smallest non-zero number is 2-1074.

However, this loop stops after just 54 iterations.

The last non-zero value is epsilon = 2-53.

It turns out that (1 + 2-53) - 1 = 0. What may be surprising to some is that this underflow occurs when epsilon is many orders of magnitude larger than 2-1074.

// constants
var Zero = 0.0, Half = 0.5, One = 1.0;
      
var count = 0;
var epsilon = One;
while (epsilon > Zero) {
  count++;
              
  // (epsilon + 1) - 1 => epsilon
  epsilon += One;
  epsilon -= One;
          
  epsilon *= Half;  // divide by 2
}
console.log('count ' + count);

Representation

Floating-point representation can be written as:

sign mantissa x 2exponent

The three components are:

  1. The sign bit applies to the overall value, not the exponent.
  2. The exponent is stored using 11 bits. The lower half of the range represents negative exponents, and the upper range represents positive exponents.
  3. The mantissa (or significand) is stored using 52 bits and holds the fractional part of the number. Normalization allows an implicit leading bit of 1, so effectively there are 53 binary bits of precision. One usually writes the bits from left-to-right as:

    b0 x 20 + b1 x 2-1 ... + b52 x 2-52

    For example, the mantissa for 0.5 = 2-1 looks like

    0100000000000000000000000000000000000000000000000000

Some caveats on the representation here. The process is actually more complicated at the hardware level. Guard digits can come into play and some values are reserved for special constants like +NaN, -0 and so on. If you'd like to dig deeper, look for the IEEE 754 standard or try Wikipedia.

So why does the loop stop so soon?

That is, why does 1 + 2-53 = 1 on a computer? To add 1 = 1 x 20 and epsilon = 1 x 2-53, there are two steps:

  1. convert the values so the exponents match, then
  2. add the mantissas.

In this example, the mantissa of 1 x 2-53 is shifted right until the exponent has a value of 0, in other words, the mantissa becomes 2-53 instead, as shown in the animation below. Then adding the mantissas, the lowest significant digit is dropped.

value
mantissa x exponent
mantissa
(as base 2)
exponent
(as base 10)
1 = 1 x 20 10000000000000000000000000000000000000000000000000000 0
2-53 = 1 x 2-53 10000000000000000000000000000000000000000000000000000 -53
1 + 2-53 = 1 10000000000000000000000000000000000000000000000000000  1  0

If running in 32-bit mode, the underflow would occur at 23 iterations.

Also there is nothing language-specific about this behavior; one can make similar examples in most any language of your choice. You can even try this process on a calculator or phone to find the size of the mantissa. I've found some calculators to have larger precision than the IEEE standard.

An additional point on performance

Floating-point division is much more expensive than floating-point multiplication. Therefore the division by 2 was written as epsilon *= 0.5 instead of epsilon /= 2.0. It's a good habit to use multiplication instead of division when possible. In this trivial example performance is not a concern, but for complicated algorithms one will often look for small optimizations such as this. One cannot always assume the compiler will do the best thing.

And so ...

This example demonstrates a few key points about all numerical codes:

  1. In general, mathematical principles may be altered when implemented in code, even if the code is logically correct. The code may work as expected most all of the time but fail with certain values!
  2. Arithmetic is not always associative; i.e. the order of arithmetic operations can change the result. For example, sorting an array of numbers and adding them from smallest to largest may produce a more accurate result than a random order. One would have to weigh the cost of sorting versus accuracy requirements.
  3. When possible, avoid using numbers of widely different magnitudes. A common practice for applications is to scale all values for internal manipulation.
  4. Developing tests can be tricky for numerical codes.

Hopefully this example shows you why numerical coding is interesting and fun!