含微室: Inexact Sum of Floating Point Numbers

It is common to calculate sum. Usually, a sum over $n$ numbers has $n - 1$ roundings.

Use $fl(x)$ as the floating number round $x$ (usually rounded by hardware, the closest representable number, with some even/odd considerations). So $n$ numbers $x_1, x_2, \ldots, x_n$ is usually calculated as

$$S(x_1, x_2, \ldots, x_n) = fl(x_n + fl(x_{n-1} + \cdots ))$$

This calculation is acceptable in most case since floating type, like double, has quite large precision. Things become worse when the number of values are large, and values are in a wild range, esp. when a lot of them could cancel each other, and the result is very small compared to some numbers. In tradition, we could use Kahan summation algorithm to improve the precision.

In this post, I try a simple algorithm which would give the rounding floating point number of the exact sum.

Two number sum could be represented exactly as the floating sum and residual, i.e. $a + b = fl(a + b) + e$, with e also a floating number. In Knuth's TAOCP book, there is a simple algorithm discussion to find the $e$.

Similar to this idea, and also considering floating formats usually have fixed width digits (mantissa part) and exponents, so the sum could be represented by a limit numbers sum. That is, for numbers $x_1, x_2, \ldots, x_n$, we could find up to $L$ numbers $y_j$, such that

$$\sum_i x_i = \sum_{j=1}^{L} y_j$$.

We can assume that a minimal number of $y_i$ have no zeros,and no overlapped precession, i.e. any $i$, $j$, $fl(y_i + y_j) = y_i$, or $y_j$, since we could always rewrite $y_i + y_j$ to $fl(y_i + y_j) + e$. Based on this, we could bound the $L$ as the exponent span divided by the digits size.

To calculate numbers sum, we manage a number series (length up to $L$), starting from empty. We add numbers to this series one by one. When the length exceeds the limit $L$, we merge some overlaps $y_i + y_j$ to $fl(y_i + y_j) + e$, and some $e$ or even $fl(y_i + y_j)$ are zeros and omitted, which makes the length bounded by $L$ again. After all numbers added, we merge all overlaps and the number in the series with largest absolute value is the final result.

Code:

Inexact Sum of Floating Point Numbers

No comments: