Floating Point: How Computers Represent Real Numbers

2023-02-08 · Leonardo Benicio

A deep exploration of IEEE 754 floating point representation, the mathematics behind binary fractions, precision limits, and the subtle bugs that can arise when working with real numbers in code.

Every programmer eventually encounters the classic puzzle: why does 0.1 + 0.2 not equal 0.3? The answer lies in how computers represent real numbers using floating point arithmetic. Understanding IEEE 754 floating point is essential for anyone writing numerical code, financial software, scientific simulations, or graphics applications.

1. The Challenge of Representing Real Numbers

Computers work with finite binary representations, but real numbers are infinite.

1.1 The Fundamental Problem

Integers are straightforward:
42 in binary = 101010
-7 in binary (two's complement, 8-bit) = 11111001

But real numbers have infinite precision:
π = 3.14159265358979323846...
1/3 = 0.33333333... (repeating)

Even simple decimals can be infinite in binary:
0.1 (decimal) = 0.0001100110011... (repeating in binary)

1.2 Why Not Fixed Point?

Fixed point representations dedicate a fixed number of bits to the integer and fractional parts:

Fixed point (16.16 format):
┌────────────────┬────────────────┐
│  Integer part  │ Fractional part│
│   (16 bits)    │   (16 bits)    │
└────────────────┴────────────────┘

Example: 123.456
Integer: 123 = 0000000001111011
Fraction: 0.456 ≈ 0.456 × 65536 = 29884 = 0111010011001100

Problems with fixed point:
- Limited range (max ~32767 with 16.16)
- Fixed precision regardless of magnitude
- 0.000001 and 1000000 need same bits for fraction

1.3 Scientific Notation to the Rescue

Floating point mimics scientific notation:

Scientific notation:
6.022 × 10²³ (Avogadro's number)
1.602 × 10⁻¹⁹ (electron charge)

Components:
- Significand (mantissa): 6.022, 1.602
- Base: 10
- Exponent: 23, -19

Binary floating point:
1.01101 × 2⁵ = 101101 (binary) = 45 (decimal)
1.01101 × 2⁻³ = 0.00101101 (binary) = 0.17578125 (decimal)

2. IEEE 754 Format

The IEEE 754 standard defines floating point representation.

2.1 Single Precision (32-bit)

┌───┬──────────────────────┬───────────────────────────────────────┐
│ S │     Exponent         │              Mantissa                 │
│1b │      8 bits          │              23 bits                  │
└───┴──────────────────────┴───────────────────────────────────────┘
 31   30              23   22                                     0

S: Sign bit (0 = positive, 1 = negative)
Exponent: Biased by 127 (stored value = actual exponent + 127)
Mantissa: Fractional part (implicit leading 1)

Value = (-1)^S × 1.Mantissa × 2^(Exponent - 127)

2.2 Double Precision (64-bit)

┌───┬───────────────────────────┬──────────────────────────────────────────────────────┐
│ S │        Exponent           │                      Mantissa                        │
│1b │        11 bits            │                      52 bits                         │
└───┴───────────────────────────┴──────────────────────────────────────────────────────┘
 63   62                    52   51                                                    0

Value = (-1)^S × 1.Mantissa × 2^(Exponent - 1023)

2.3 Encoding Example

Let’s encode -6.75 in single precision:

Step 1: Convert to binary
6.75 = 6 + 0.75
6 = 110 (binary)
0.75 = 0.5 + 0.25 = 0.11 (binary)
6.75 = 110.11 (binary)

Step 2: Normalize
110.11 = 1.1011 × 2²

Step 3: Extract components
Sign: 1 (negative)
Exponent: 2 + 127 = 129 = 10000001
Mantissa: 1011 (drop leading 1, pad with zeros)
         10110000000000000000000

Step 4: Combine
1 10000001 10110000000000000000000

Hex: 0xC0D80000

2.4 Decoding Example

Decode 0x40490FDB (single precision):

Binary: 0 10000000 10010010000111111011011

Sign: 0 (positive)
Exponent: 10000000 = 128, actual = 128 - 127 = 1
Mantissa: 1.10010010000111111011011

Value: 1.10010010000111111011011 × 2¹
     = 11.0010010000111111011011
     = 3 + 0.140625 + ...
     ≈ 3.14159274...

This is π (to single precision accuracy)!

3. Special Values

IEEE 754 reserves certain bit patterns for special cases.

3.1 Zero

Positive zero: 0 00000000 00000000000000000000000 = +0.0
Negative zero: 1 00000000 00000000000000000000000 = -0.0

+0.0 == -0.0 evaluates to true
But 1.0/+0.0 = +∞ and 1.0/-0.0 = -∞

3.2 Infinity

Positive infinity: 0 11111111 00000000000000000000000 = +∞
Negative infinity: 1 11111111 00000000000000000000000 = -∞

Created by:
- Division by zero: 1.0/0.0 = +∞
- Overflow: 1e38 * 1e38 = +∞

Properties:
- ∞ + 1 = ∞
- ∞ - ∞ = NaN
- ∞ × 0 = NaN

3.3 NaN (Not a Number)

NaN: Exponent all 1s, non-zero mantissa

Examples:
0 11111111 10000000000000000000000 = quiet NaN
0 11111111 00000000000000000000001 = signaling NaN

Created by:
- 0.0/0.0
- ∞ - ∞
- sqrt(-1)

Properties:
- NaN != NaN (NaN is not equal to itself!)
- NaN + anything = NaN
- Any comparison with NaN returns false

3.4 Denormalized Numbers (Subnormals)

Normal numbers: exponent ≠ 0, implicit leading 1
Denormals: exponent = 0, implicit leading 0

Smallest normal (single): 1.0 × 2⁻¹²⁶ ≈ 1.18 × 10⁻³⁸
Smallest denormal (single): 2⁻²³ × 2⁻¹²⁶ = 2⁻¹⁴⁹ ≈ 1.4 × 10⁻⁴⁵

Denormals fill the gap between 0 and smallest normal:

     0    denormals    smallest normal
     │◄──────────────►│◄───────────────
     │ gradual        │ normal numbers
     │ underflow      │

4. Precision and Rounding

Understanding the limits of floating point precision.

4.1 Significant Digits

Single precision (23-bit mantissa + implicit 1):
24 bits of precision
≈ 7.22 decimal digits

Double precision (52-bit mantissa + implicit 1):
53 bits of precision
≈ 15.95 decimal digits

Example (single precision):
16777216.0 + 1.0 = 16777216.0  (not 16777217!)

Why? 16777216 = 2²⁴, needs 25 bits to represent 16777217

4.2 The Infamous 0.1 + 0.2 Problem

>>> 0.1 + 0.2
0.30000000000000004

>>> 0.1 + 0.2 == 0.3
False

Why does this happen?

0.1 in binary = 0.0001100110011001100110011... (repeating)

Stored as (double):
0.1 ≈ 0.1000000000000000055511151231257827021181583404541015625

0.2 in binary = 0.001100110011001100110011... (repeating)

Stored as (double):
0.2 ≈ 0.2000000000000000111022302462515654042363166809082031250

Sum:
0.1 + 0.2 ≈ 0.3000000000000000444089209850062616169452667236328125

0.3 stored as:
0.3 ≈ 0.2999999999999999888977697537484345957636833190917968750

The stored representations don't match!

4.3 Rounding Modes

IEEE 754 defines five rounding modes:

1. Round to Nearest, Ties to Even (default)
   2.5 → 2, 3.5 → 4, 4.5 → 4, 5.5 → 6
   Minimizes cumulative rounding error

2. Round Toward Zero (truncation)
   2.9 → 2, -2.9 → -2

3. Round Toward +∞ (ceiling)
   2.1 → 3, -2.9 → -2

4. Round Toward -∞ (floor)
   2.9 → 2, -2.1 → -3

5. Round to Nearest, Ties Away from Zero
   2.5 → 3, -2.5 → -3

4.4 Epsilon and ULP

Machine epsilon (ε): smallest x such that 1.0 + x ≠ 1.0
Single: ε ≈ 1.19 × 10⁻⁷
Double: ε ≈ 2.22 × 10⁻¹⁶

ULP (Unit in Last Place): gap between adjacent floats
At 1.0: ULP = ε
At 2.0: ULP = 2ε
At 1024.0: ULP = 1024ε

The gap between representable numbers grows with magnitude!

5. Common Pitfalls and Bugs

Floating point arithmetic has many subtle traps.

5.1 Equality Comparison

// WRONG: Direct equality comparison
if (result == expected) { ... }

// BETTER: Epsilon comparison
#define EPSILON 1e-9
if (fabs(result - expected) < EPSILON) { ... }

// BEST: Relative epsilon
bool approximately_equal(double a, double b, double rel_epsilon) {
    double diff = fabs(a - b);
    a = fabs(a);
    b = fabs(b);
    double largest = (b > a) ? b : a;
    return diff <= largest * rel_epsilon;
}

5.2 Catastrophic Cancellation

// Computing b² - 4ac when b ≈ √(4ac)
// Example: a=1, b=10000, c=1
// b² = 100000000
// 4ac = 4
// b² - 4ac = 99999996

// But if b=1000000.0001, c=250000000025.0001
// We lose almost all significant digits!

double discriminant = b*b - 4*a*c;  // Catastrophic cancellation

// Better: Reformulate if possible
// Or use higher precision arithmetic

5.3 Accumulation Error

// Summing many small numbers
float sum = 0.0f;
for (int i = 0; i < 10000000; i++) {
    sum += 0.1f;  // Each addition adds error
}
// sum ≈ 999999.9 (not 1000000.0)

// Solution: Kahan summation
float sum = 0.0f;
float c = 0.0f;  // Running compensation for lost low-order bits
for (int i = 0; i < 10000000; i++) {
    float y = 0.1f - c;      // c is zero initially
    float t = sum + y;       // sum is big, y small, low-order digits lost
    c = (t - sum) - y;       // (t - sum) recovers high part of y
    sum = t;                 // algebraically, c should be zero
}
// sum ≈ 1000000.0 (much more accurate)

5.4 Order of Operations Matters

// These are NOT equivalent:
double a = (x + y) + z;
double b = x + (y + z);

// Example where order matters:
// x = 1e30, y = -1e30, z = 1.0
// (x + y) + z = 0 + 1 = 1
// x + (y + z) = 1e30 + (-1e30) = 0 (z absorbed)

// Compilers may reorder with -ffast-math!

5.5 Integer to Float Conversion

// Large integers may lose precision
long long big = 9007199254740993LL;  // 2^53 + 1
double d = (double)big;
long long back = (long long)d;
// back = 9007199254740992  (lost 1!)

// 2^53 = 9007199254740992 is the last consecutive integer
// representable exactly in double precision

6. Floating Point in Practice

Real-world considerations for numerical code.

6.1 Financial Calculations

// NEVER use float/double for money!
double price = 19.99;
double quantity = 3;
double total = price * quantity;
// total might be 59.96999999999999...

// Use fixed-point decimal types
// Store cents as integers
long price_cents = 1999;
long quantity = 3;
long total_cents = price_cents * quantity;  // 5997 cents = $59.97

// Or use decimal libraries
// Python: from decimal import Decimal
// Java: BigDecimal
// C#: decimal type

6.2 Numerical Stability

// Unstable: Variance calculation (one-pass, naive)
double sum = 0, sum_sq = 0;
for (int i = 0; i < n; i++) {
    sum += x[i];
    sum_sq += x[i] * x[i];
}
double variance = (sum_sq - sum*sum/n) / (n-1);  // Catastrophic cancellation!

// Stable: Welford's online algorithm
double mean = 0, M2 = 0;
for (int i = 0; i < n; i++) {
    double delta = x[i] - mean;
    mean += delta / (i + 1);
    double delta2 = x[i] - mean;
    M2 += delta * delta2;
}
double variance = M2 / (n - 1);

6.3 Comparing Floating Point Numbers

// For testing/assertions: relative + absolute tolerance
bool close_enough(double a, double b, double rel_tol, double abs_tol) {
    // Handle special cases
    if (isnan(a) || isnan(b)) return false;
    if (isinf(a) || isinf(b)) return a == b;
    
    double diff = fabs(a - b);
    
    // Absolute tolerance for numbers near zero
    if (diff < abs_tol) return true;
    
    // Relative tolerance for larger numbers
    double largest = fmax(fabs(a), fabs(b));
    return diff <= largest * rel_tol;
}

// Python's math.isclose default:
// rel_tol=1e-9, abs_tol=0.0

6.4 Compiler Flags and Fast Math

# GCC/Clang fast-math flags
-ffast-math        # Enables all unsafe optimizations

# Individual flags:
-fno-signed-zeros      # Assume +0 = -0
-fno-trapping-math     # Assume no FP exceptions
-ffinite-math-only     # Assume no inf/nan
-fassociative-math     # Allow (a+b)+c = a+(b+c)
-freciprocal-math      # Allow x/y = x*(1/y)

# These break IEEE 754 compliance but can be 2-3x faster
# Use only when you understand the implications!

6.5 Hardware Considerations

x87 FPU (legacy):
- 80-bit extended precision internally
- Results depend on precision control register
- Can give different results than SSE

SSE/SSE2:
- Native 32/64-bit operations
- More predictable results
- Default on modern x86-64

ARM NEON:
- May flush denormals to zero by default
- Different rounding behavior possible

Always test numerical code on target hardware!

7. Extended and Alternative Formats

Beyond standard float and double.

7.1 Extended Precision

x87 Extended (80-bit):
Sign: 1 bit
Exponent: 15 bits
Mantissa: 64 bits (explicit leading bit)
Range: ±1.2 × 10⁻⁴⁹³² to ±1.2 × 10⁴⁹³²
Precision: ~19 decimal digits

IEEE 754-2008 Quad (128-bit):
Sign: 1 bit
Exponent: 15 bits
Mantissa: 112 bits
Range: ±6.5 × 10⁻⁴⁹⁶⁶ to ±1.2 × 10⁴⁹³²
Precision: ~34 decimal digits

7.2 Half Precision (16-bit)

IEEE 754 Half (binary16):
┌───┬─────────────┬────────────────────────┐
│ S │  Exponent   │       Mantissa         │
│1b │   5 bits    │       10 bits          │
└───┴─────────────┴────────────────────────┘

Range: ±6.1 × 10⁻⁵ to ±65504
Precision: ~3.3 decimal digits

Used in:
- Machine learning (bfloat16 variant popular in AI)
- Graphics (color values, HDR)
- Storage (when precision less critical than size)

7.3 bfloat16

Google Brain Float16:
┌───┬─────────────┬────────────────┐
│ S │  Exponent   │   Mantissa     │
│1b │   8 bits    │   7 bits       │
└───┴─────────────┴────────────────┘

Same exponent range as float32
Truncated mantissa (less precision)

Benefits for ML:
- Fast conversion to/from float32
- Same dynamic range as float32
- Sufficient precision for neural network weights

7.4 Arbitrary Precision Libraries

# Python mpmath
from mpmath import mp, mpf
mp.dps = 50  # 50 decimal places

x = mpf('0.1')
y = mpf('0.2')
z = mpf('0.3')
print(x + y == z)  # True!

# GMP (GNU Multiple Precision)
# MPFR (Multiple Precision Floating-Point Reliable)

8. Floating Point Exceptions

IEEE 754 defines five exception conditions.

8.1 Exception Types

1. Invalid Operation
   - 0/0, ∞-∞, 0×∞, sqrt(-1)
   - Result: NaN

2. Division by Zero
   - x/0 where x ≠ 0
   - Result: ±∞

3. Overflow
   - Result too large to represent
   - Result: ±∞ or ±MAX (depending on rounding)

4. Underflow
   - Result too small (becomes denormal or zero)
   - Result: denormal or 0

5. Inexact
   - Result required rounding
   - Happens on almost every operation!

8.2 Exception Handling in C

#include <fenv.h>

int main() {
    // Clear exception flags
    feclearexcept(FE_ALL_EXCEPT);
    
    // Perform operations
    double x = 1.0 / 0.0;  // Division by zero
    double y = sqrt(-1.0); // Invalid
    
    // Check which exceptions occurred
    if (fetestexcept(FE_DIVBYZERO)) {
        printf("Division by zero occurred\n");
    }
    if (fetestexcept(FE_INVALID)) {
        printf("Invalid operation occurred\n");
    }
    
    // Can also trap on exceptions (platform-specific)
    return 0;
}

8.3 Debugging with Floating Point Exceptions

// Enable trapping (causes SIGFPE on exception)
#define _GNU_SOURCE
#include <fenv.h>

void enable_fp_exceptions() {
    feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW);
}

// Now invalid operations crash immediately
// Much easier to find the source of NaN!

9. Testing Floating Point Code

Strategies for verifying numerical correctness.

9.1 Unit Testing Approaches

import math
import unittest

class TestNumerics(unittest.TestCase):
    def test_close_values(self):
        result = compute_something()
        expected = 3.14159
        
        # assertAlmostEqual uses places (decimal places)
        self.assertAlmostEqual(result, expected, places=5)
        
        # Or use math.isclose for relative tolerance
        self.assertTrue(math.isclose(result, expected, rel_tol=1e-6))
    
    def test_special_values(self):
        self.assertTrue(math.isnan(0.0 / 0.0))
        self.assertTrue(math.isinf(1.0 / 0.0))
        
    def test_edge_cases(self):
        # Test near overflow
        # Test near underflow
        # Test with denormals
        # Test with very large/small inputs
        pass

9.2 Property-Based Testing

from hypothesis import given, strategies as st

@given(st.floats(allow_nan=False, allow_infinity=False))
def test_square_root_property(x):
    if x >= 0:
        result = math.sqrt(x)
        # sqrt(x)^2 should be close to x
        assert math.isclose(result * result, x, rel_tol=1e-10)

9.3 Reference Implementations

// Compare against arbitrary precision library
#include <mpfr.h>

void verify_accuracy() {
    mpfr_t x, expected;
    mpfr_init2(x, 256);  // 256 bits precision
    mpfr_init2(expected, 256);
    
    // Compute in high precision
    mpfr_set_d(x, 0.1, MPFR_RNDN);
    mpfr_sin(expected, x, MPFR_RNDN);
    
    // Compare with standard library
    double std_result = sin(0.1);
    double mpfr_result = mpfr_get_d(expected, MPFR_RNDN);
    
    double error = fabs(std_result - mpfr_result);
    assert(error < 1e-15);  // Within expected precision
    
    mpfr_clear(x);
    mpfr_clear(expected);
}

10. Performance Optimization

Making floating point code fast.

10.1 SIMD Vectorization

// Scalar loop
for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// Vectorized (AVX2, 8 floats at once)
#include <immintrin.h>
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vc = _mm256_add_ps(va, vb);
    _mm256_store_ps(&c[i], vc);
}

10.2 Avoiding Denormals

// Denormal operations can be 10-100x slower!

// Solution 1: Flush denormals to zero (DAZ + FTZ)
#include <immintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);

// Solution 2: Add small constant to avoid denormals
#define SMALL_CONSTANT 1e-30f
float safe_divide(float a, float b) {
    return a / (b + SMALL_CONSTANT);
}

10.3 Division is Expensive

// Division is ~10-20x slower than multiplication

// SLOW: Multiple divisions by same value
for (int i = 0; i < n; i++) {
    result[i] = data[i] / scale;  // Division each iteration
}

// FAST: Compute reciprocal once
float inv_scale = 1.0f / scale;
for (int i = 0; i < n; i++) {
    result[i] = data[i] * inv_scale;  // Multiplication each iteration
}

10.4 Fast Approximations

// Fast inverse square root (Quake III)
// Famous but now mostly obsolete (rsqrtss is faster)
float q_rsqrt(float number) {
    long i;
    float x2, y;
    const float threehalfs = 1.5F;
    
    x2 = number * 0.5F;
    y = number;
    i = *(long *)&y;
    i = 0x5f3759df - (i >> 1);  // Magic constant!
    y = *(float *)&i;
    y = y * (threehalfs - (x2 * y * y));  // Newton-Raphson
    return y;
}

// Modern: Use hardware RSQRT with Newton-Raphson refinement
#include <immintrin.h>
float fast_rsqrt(float x) {
    __m128 v = _mm_set_ss(x);
    v = _mm_rsqrt_ss(v);  // Hardware approximation
    // Optional: Newton-Raphson for more precision
    return _mm_cvtss_f32(v);
}

11. Language-Specific Considerations

Different languages handle floating point differently.

11.1 Python

# Python float is always 64-bit double
import sys
print(sys.float_info)

# Decimal for exact decimal arithmetic
from decimal import Decimal, getcontext
getcontext().prec = 50  # 50 significant digits

d = Decimal('0.1') + Decimal('0.2')
print(d == Decimal('0.3'))  # True!

# Fractions for exact rational arithmetic
from fractions import Fraction
f = Fraction(1, 10) + Fraction(2, 10)
print(f == Fraction(3, 10))  # True!

11.2 JavaScript

// JavaScript only has 64-bit doubles (Number)
console.log(0.1 + 0.2);  // 0.30000000000000004

// BigInt for exact large integers (but no decimals)
const big = 9007199254740993n;  // Beyond safe integer

// For exact decimals, use libraries like decimal.js

11.3 Java

// StrictMath for reproducible results
double a = StrictMath.sin(0.5);

// BigDecimal for exact decimal arithmetic
import java.math.BigDecimal;
BigDecimal d = new BigDecimal("0.1")
    .add(new BigDecimal("0.2"));
System.out.println(d.equals(new BigDecimal("0.3")));  // true

// strictfp keyword for reproducible floating point
strictfp class ReproducibleMath {
    double compute(double x) {
        return Math.sin(x) * Math.cos(x);
    }
}

11.4 Rust

// Rust has f32 and f64, follows IEEE 754
let x: f64 = 0.1 + 0.2;
println!("{}", x);  // 0.30000000000000004

// Explicit handling of special values
if x.is_nan() { /* handle */ }
if x.is_infinite() { /* handle */ }

// Total ordering for floats (including NaN)
use std::cmp::Ordering;
fn total_cmp(a: f64, b: f64) -> Ordering {
    a.total_cmp(&b)
}

12. Real-World Floating Point Stories

Historical incidents and lessons learned from floating point bugs.

12.1 The Patriot Missile Failure (1991)

During the Gulf War, a Patriot missile battery failed to intercept 
a Scud missile, resulting in 28 deaths.

Root cause: Time accumulated in 0.1 second increments
- 0.1 cannot be represented exactly in binary
- Error: ~0.000000095 seconds per tick
- After 100 hours: 0.34 second drift
- Scud traveling at Mach 5: 500+ meter targeting error

The fix: Periodic system restart (not implemented)

Lesson: Tiny errors accumulate over time

12.2 The Vancouver Stock Exchange Index (1982)

The Vancouver Stock Exchange started a new index at 1000.000.

After 22 months, the index had fallen to ~520.
Problem: Should have been around ~1098.

Root cause: Truncation instead of rounding
- Index recalculated thousands of times daily
- Each calculation truncated to 3 decimal places
- Each truncation lost a tiny amount of value
- Cumulative loss: nearly half the index value!

The fix: Proper rounding, recalculation from scratch

12.3 The Ariane 5 Explosion (1996)

The Ariane 5 rocket exploded 37 seconds after launch.
Cost: $370 million cargo lost.

Root cause: Float-to-integer conversion overflow
- Horizontal velocity stored as 64-bit float
- Converted to 16-bit signed integer
- Ariane 5 was faster than Ariane 4
- Velocity exceeded 32767 (16-bit max)
- Exception handler shut down navigation
- Rocket veered off course, self-destructed

Lesson: Always check range before conversion

12.4 Excel 2007 Bug

Excel 2007 displayed certain calculation results incorrectly.

850 × 77.1 = 65535 (should be 65534.99999...)

Root cause: Special case in display formatting
- Results very close to 65536 or 65535
- Formatting code had incorrect boundary check
- Binary representation was correct
- Only display was wrong

Lesson: Floating point bugs can hide in unexpected places

12.5 Games and Physics Engines

Common floating point issues in games:

1. Coordinate precision at world edges
   - Player at (1000000, 1000000) has less precision
   - Objects jitter or behave erratically
   - Solution: Floating origin (re-center world)

2. Deterministic multiplayer
   - Different CPUs give different results
   - x87 vs SSE, compiler flags matter
   - Solution: Fixed point, or strict FP settings

3. Physics tunneling
   - Fast objects pass through walls
   - Position update exceeds collision bounds
   - Solution: Continuous collision detection

13. Interval Arithmetic and Error Bounds

Tracking and bounding floating point error.

13.1 Interval Arithmetic Basics

// Instead of a single value, track [lower, upper] bounds
typedef struct {
    double lo;  // Lower bound
    double hi;  // Upper bound
} Interval;

Interval interval_add(Interval a, Interval b) {
    // Set rounding modes for guaranteed bounds
    Interval result;
    fesetround(FE_DOWNWARD);
    result.lo = a.lo + b.lo;
    fesetround(FE_UPWARD);
    result.hi = a.hi + b.hi;
    return result;
}

// After computation, interval width shows error bounds
double width(Interval i) {
    return i.hi - i.lo;
}

13.2 Applications of Interval Arithmetic

Validated numerics:
- Prove results are correct within bounds
- Detect when computation is unstable
- Used in formal verification

Computer graphics:
- Ray-box intersection with guaranteed correctness
- Robust geometric predicates

Scientific computing:
- Verified solutions to differential equations
- Trusted optimization results

13.3 Error Analysis Example

// Analyzing error in polynomial evaluation
// f(x) = x³ - 3x + 2 at x = 1.0000001

// Direct evaluation (Horner's method)
double horner(double x) {
    return ((x) * x - 3) * x + 2;  // 3 ops
}

// Error analysis:
// Each operation introduces error ≤ 0.5 ULP
// Errors can compound or cancel
// Result error: typically a few ULPs

// For critical code: use compensated algorithms
// or interval arithmetic for guaranteed bounds

14. Floating Point in Databases and Distributed Systems

Special considerations for data storage and transmission.

14.1 Database Storage

-- Different databases handle floats differently

-- PostgreSQL: real (32-bit), double precision (64-bit)
-- Exact comparison unreliable
SELECT * FROM t WHERE price = 19.99;  -- Risky!
SELECT * FROM t WHERE ABS(price - 19.99) < 0.001;  -- Better

-- For money: Use DECIMAL/NUMERIC
CREATE TABLE products (
    id INT PRIMARY KEY,
    price DECIMAL(10, 2)  -- 10 digits, 2 after decimal
);

14.2 Serialization Challenges

// Binary serialization: preserve exact bits
void serialize_double(double d, uint8_t *buf) {
    memcpy(buf, &d, sizeof(double));
}

// Text serialization: can lose precision!
printf("%.15g", d);  // May not round-trip exactly

// Round-trip guarantee requires enough digits
// Double: 17 significant digits for round-trip
printf("%.17g", d);  // Guaranteed round-trip

// Or use hexadecimal float format
printf("%a", d);  // Example: 0x1.921fb54442d18p+1

14.3 Distributed Computation

Challenges in distributed floating point:

1. Different hardware architectures
   - x86 vs ARM may give slightly different results
   - GPU vs CPU differences
   
2. Non-deterministic reduction
   - Parallel sum depends on order
   - Results vary between runs
   
3. Solutions:
   - Require specific rounding mode
   - Use reproducible reduction algorithms
   - Define tolerance for result matching
   - Use fixed-point for critical calculations

14.4 Cross-Platform Reproducibility

// Achieving reproducible results across platforms

// 1. Use strict floating point
#pragma STDC FENV_ACCESS ON

// 2. Avoid excess precision
// Compile with: -ffp-contract=off
// Or explicitly round intermediate results
float temp = (float)(a * b);  // Force single precision

// 3. Specify rounding mode
fesetround(FE_TONEAREST);

// 4. Handle denormals consistently
// Either flush to zero everywhere, or preserve everywhere

// 5. Be careful with transcendental functions
// sin(), exp(), etc. may differ between platforms
// Consider Taylor series or lookup tables for consistency

15. Advanced Topics

Cutting-edge developments in floating point.

15.1 Posit Numbers

An alternative to IEEE 754 floats:

Posit format:
┌───┬────────────┬──────────┬───────────┐
│ S │   Regime   │ Exponent │  Fraction │
│1b │  variable  │ variable │  variable │
└───┴────────────┴──────────┴───────────┘

Key differences:
- Tapered precision (more bits near 1.0)
- No NaN or ±∞ (controversial)
- Simpler exception handling
- Claimed better accuracy for many tasks

Status: Research, not widely adopted yet

15.2 Stochastic Rounding

Traditional: Round to nearest (deterministic)
Stochastic: Probabilistically round up or down

Example: 2.3 rounded stochastically
- 70% chance → 2
- 30% chance → 3
- Expected value = 2.3 (unbiased!)

Benefits for machine learning:
- Prevents systematic rounding bias
- Better gradient flow in training
- Enables lower precision without accuracy loss

15.3 Mixed Precision Computing

Strategy: Use different precisions for different purposes

Example in deep learning:
1. Store weights in FP32
2. Compute forward pass in FP16/BF16
3. Accumulate in FP32
4. Update weights in FP32

Benefits:
- 2x memory savings
- 2-8x compute speedup (tensor cores)
- Minimal accuracy loss

Requires careful loss scaling to prevent underflow

15.4 Unum and Type III Unums

Universal Numbers (Unums):
- Variable-size representation
- Exact arithmetic when possible
- Track uncertainty explicitly

Type III Unums (Posits + Valids):
- Posits for individual values
- Valids for interval bounds
- Goal: Replace IEEE 754

Current status: Academic research
Adoption barriers: Hardware support, ecosystem

16. Practical Guidelines by Domain

Different applications have different floating point needs.

16.1 Scientific Computing

Requirements:
- Maximum precision for simulation accuracy
- Error propagation awareness
- Reproducibility for verification

Recommendations:
✓ Use double precision by default
✓ Consider quad precision for ill-conditioned problems
✓ Implement error estimation
✓ Use stable algorithms (pivoting, Kahan summation)
✓ Verify against analytical solutions when available
✓ Document numerical assumptions and limitations

16.2 Financial Applications

Requirements:
- Exact decimal arithmetic for regulations
- No rounding surprises
- Audit trail accuracy

Recommendations:
✓ NEVER use float/double for money
✓ Use decimal types (BigDecimal, Decimal, NUMERIC)
✓ Define rounding rules explicitly
✓ Store as integers (cents, basis points)
✓ Test with boundary values and regulatory scenarios
✓ Document rounding policy in specifications

16.3 Machine Learning and AI

Requirements:
- Throughput over precision
- Memory efficiency for large models
- GPU/accelerator compatibility

Recommendations:
✓ Use FP16/BF16 for inference when possible
✓ Mixed precision training (FP16 compute, FP32 accum)
✓ Monitor for overflow/underflow during training
✓ Use loss scaling for gradient underflow prevention
✓ Quantization-aware training for INT8 deployment
✓ Test model accuracy across precision levels

16.4 Graphics and Games

Requirements:
- Fast performance for real-time rendering
- Visual correctness (not numerical)
- Large world coordinate handling

Recommendations:
✓ Float32 for most calculations
✓ Floating origin for open worlds
✓ Double precision for physics simulation core
✓ Robust geometric predicates for collision
✓ Fast approximations where visual error is acceptable
✓ Test at extreme coordinates and time values

16.5 Embedded Systems

Requirements:
- Limited hardware resources
- Deterministic timing
- Power efficiency

Recommendations:
✓ Consider fixed-point arithmetic
✓ Use software float emulation if no FPU
✓ Profile floating point instruction costs
✓ Avoid denormals (flush to zero)
✓ Minimize divisions (precompute reciprocals)
✓ Consider CORDIC algorithms for trig functions

16.6 Web and JavaScript Applications

Requirements:
- Cross-browser consistency
- Only 64-bit doubles available (in Number)
- Interoperability with JSON

Recommendations:
✓ Use BigInt for large exact integers
✓ Decimal.js or similar for precise decimals
✓ Be aware of JSON number limitations
✓ Validate numeric ranges from user input
✓ Use explicit rounding for display
✓ Test across browsers and platforms

17. Debugging Floating Point Issues

Systematic approaches to finding and fixing floating point bugs.

17.1 Diagnostic Techniques

// Print exact representation
#include <stdio.h>
void print_float_bits(float f) {
    unsigned int bits;
    memcpy(&bits, &f, sizeof(bits));
    printf("Value: %g\n", f);
    printf("Hex: 0x%08X\n", bits);
    printf("Sign: %d\n", (bits >> 31) & 1);
    printf("Exp: %d (biased %d)\n", 
           ((bits >> 23) & 0xFF) - 127,
           (bits >> 23) & 0xFF);
    printf("Mantissa: 0x%06X\n", bits & 0x7FFFFF);
}

// Hex float format in C99
printf("%a\n", 0.1);  // Prints: 0x1.999999999999ap-4

17.2 Common Symptoms and Causes

Symptom: Result is NaN
Causes:
├─ 0/0, ∞-∞, 0×∞
├─ sqrt of negative number
├─ Uninitialized floating point variable
└─ Error propagated from earlier computation

Symptom: Result is unexpectedly zero
Causes:
├─ Underflow (value too small)
├─ Catastrophic cancellation
└─ Denormal flushing

Symptom: Results differ between debug/release
Causes:
├─ Different optimization levels
├─ x87 vs SSE codegen
├─ Fast-math flags enabled in release
└─ Uninitialized memory (different in debug)

Symptom: Results differ between machines
Causes:
├─ Different CPU architectures
├─ Different compiler versions
├─ Different math library implementations
└─ Different SIMD instruction sets

17.3 Floating Point Sanitizers

# GCC/Clang undefined behavior sanitizer catches some FP issues
clang -fsanitize=undefined,float-divide-by-zero program.c

# Enable FP exceptions to catch issues early
#define _GNU_SOURCE
#include <fenv.h>
feenableexcept(FE_INVALID | FE_OVERFLOW | FE_DIVBYZERO);
// Now bad operations cause SIGFPE

# Valgrind can detect some issues
valgrind --tool=memcheck ./program

17.4 Systematic Testing Strategy

1. Boundary values
   ├─ Zero (positive and negative)
   ├─ Smallest positive normal and denormal
   ├─ Largest finite value
   ├─ Infinity and NaN
   └─ Powers of 2 (exactly representable)

2. Special cases
   ├─ Values that can't be represented exactly (0.1, 0.2)
   ├─ Values near overflow threshold
   ├─ Values near underflow threshold
   └─ Values that cause cancellation

3. Randomized testing
   ├─ Property-based testing (Hypothesis, QuickCheck)
   ├─ Comparison with arbitrary precision library
   └─ Cross-platform result comparison

4. Stress testing
   ├─ Many iterations of accumulation
   ├─ Deep recursion with floating point
   └─ Extreme input ranges

18. Summary

Floating point representation is a fundamental compromise between range, precision, and performance:

IEEE 754 format:

  • Sign bit, biased exponent, implicit-leading-1 mantissa
  • Single (32-bit): ~7 decimal digits
  • Double (64-bit): ~16 decimal digits

Special values:

  • Zero (positive and negative)
  • Infinity (positive and negative)
  • NaN (not equal to itself)
  • Denormals (gradual underflow)

Common pitfalls:

  • Never compare floats with ==
  • 0.1 cannot be represented exactly
  • Order of operations affects results
  • Large magnitude differences cause precision loss

Best practices:

  • Use double unless you have good reason not to
  • Never use float/double for money
  • Use epsilon comparisons with relative tolerance
  • Consider Kahan summation for accuracy
  • Test edge cases and special values
  • Understand your compiler’s optimization flags

Understanding floating point is essential for writing correct numerical code. The abstractions are leaky, the edge cases are numerous, and the bugs are subtle. But with knowledge of how floating point works internally, you can anticipate problems and write robust numerical software.