The equivalent of “When I was your age, we use to walk to school in the snow. *Barefoot*”, in DSP, is to say we used to do DSP in fixed point math. The fixed-point system could be made of integer and shift operations in software, or built into fixed-point DSP chips such as the 56K family, or other hardware implementations. Floating point was not available in dedicated DSP chips, and took too many cycles on a CPU. But the general usefulness of floating point lead to it being optimized in CPUs, making it the overwhelming choice for DSP today.

A floating binary point allows floating point math a huge range. It essentially gives us a magic ruler that can be stretch or shrunk at will, to measure huge or tiny distances with the same relative accuracy. The catch is the word “relative”. There are the same number of tick marks on the ruler at any size, and that makes for some properties that might not be immediately obvious. Besides the effects of math involving quantized samples, significant errors to watch out for are in addition, multiplication, and denormals. We’ll cover denormals in the next post.

### Definitions

Floating point is well defined in IEEE standards that are widely adopted in order to give consistent results from one computer and processor to another. There are many ways to do floating point, but these are the only ones we need consider.

“Single precision” (32-bit) floating point, “float” in C, is ample to hold samples of a digital audio stream. It has a sign bit, 23 bits of mantissa, and 8 bits of exponent. But floating point execution units typically keep results in a “normalized” state, where the leading digit is always a “1” and the exponent is adjusted accordingly. This makes best use of the available mantissa bits. Since the leading bit is always 1, it’s omitted, yielding 24 mantissa bits, effectively. But we have a sign bit too, so it’s equivalent to 25-bit audio. It matches up conveniently to 24-bit audio converters.

Double precision (64-bit), “double” in C, gives substantially more of everything, at little cost in processing time (though twice the memory use). A sign bit, 52 bits of mantissa, and 11 bits of exponent. The increase in exponent isn’t important for audio, but the 54 bits of precision is a big plus to minimize accumulated errors.

For an idea of how the two compare, precision-wise, consider that there are 31557600 seconds in a year. The 25 bits of effective precision (if using the full range of negative to positive) in a 32-bit float has a range of 33554432 (2^{25}), a resolution of about a second out of a year. A 64-bit double has a range of 18014398509482000 (2^{54}), for a resolution of about two nanoseconds.

For fixed point, it depends on the implementation but the standard is the 56k family, which multiplies 24-bit (leading decimal point) numbers for a result with 48 bits to the right of the decimal point and 8 to the left. The extra 8 bits allow headroom for multiply-accumulate cycles to grow greater than one.

### Which size to use

32-bit floats are ample for storage and for the sample busses between processing—the 64-bit sample size used for buffers by some DAWs is overkill. 32-bit floats are adequate for many computations, on par with 24-bit fixed point systems (better in most regards but falling short in some—the 56K’s extended-precision accumulators allow long FIRs to be more precise). Either size has far more range in the exponent than we’ll use—audio simply doesn’t require that incredible dynamic range.

But CPUs are heavily optimized for double-precision floating point, and the added precision is often necessary in audio DSP algorithms. For typical modern Mac/PC host processors, stick with doubles in your DSP calculations; the performance difference is small, and your life will be a lot easier, with less chance of a surprise. But you may need to use single-precision, especially in DSP chips or SIMD and some processors where the performance difference is large, so you should understand the limitations.

### Errors in multiplication

When you multiply two arbitrary numbers of *n* digits of precision, the result has *2n* digits of precision. 9 x 9 = 81, 99 x 99 = 9801, .9 x .9 = .81. But floating point processors do not give results in a higher precision—the result is truncated to fit in the same number of bits as the operands (essentially, .1111 x .9999 = .1110 instead of the precise .11108889, for an example of 4-digit decimal math). This means an algorithm using single precision floats may have significantly more error than the same algorithm done in a fixed point DSP like the 56K family, which allows accumulation of double-precision results. You need to consider the effects of this error on your algorithm. Or, just stick with double precision computation, which has so much precision that you can usually ignore this truncation error for audio (but watch out for things like raising values to high powers—you can’t freely code arbitrarily high-order IIR filters even with doubles, which is why we usually factor higher orders into a cascade of biquads).

### Errors in addition

If you’ve worked with fixed-point math, you’re used to the idea that you discard precision from multiplying when storing back to the original sample size, but addition is basically perfect (as long as you don’t overflow, by either guarding against it, or using an accumulator with headroom). With floating point, you have the same fundamental issues as with multiplication (but with no choice), and a whole new problem with addition. I think this is something many don’t consider, so this section might be my most important point.

*Floating point doesn’t do well when adding two values of vastly different sizes.*

If our magic ruler is stretched to measure the distance from here to the sun, each tick is bout 5 km, so adding a measurement of a bug’s wing to it will do nothing—it’s too small of a distance to make it to the next tick, the next highest possible number. (With zero placed at the midpoint of the span, the effective mantissa range of a 32-bit float is 2^{25}, or 33554432. The scale is our distance to the sun, 150 million km—150000000 km / 33554432 is about 5 km per “tick”. For doubles, it’s half the width of the finest human hair!)

This can be a huge problem in iterative math. Consider that a 32-bit unsigned integer has a maximum value of 4,294,967,295. As long as you don’t exceed that amount, you can count through every value from 0 to that maximum by adding one for each count. A 32-bit float has a maximum value of 3.402823 × 10^{38}, but can count through a maximum of 16,777,216 steps incrementally, no matter the size. After that, the increment is too small compared with the size of the running total, and has no effect.

For instance, a big mistake would be to use a 32-bit float as a timer, incrementing every sample period, to use in an automation ramp calculation. Even if the time increment is a very small number, the timer will stop incrementing after six minutes because the increment’s mantissa can no longer be aligned with that of the running total. 6.3 minutes * 60 seconds/minute * 44100 samples/second is about 2^{24} (16777216) samples—we’re assuming only positive values here. The good news is that when using doubles, it would take a few thousand years to hit that point.

Try this:

// Incrementing integer and float
#include <iostream>
#include <cstdint>
using namespace std;
int main() {
const long objectives[] = { 100, 10000, 1000000, 100000000, 1000000000 };
for(const long objective : objectives) {
float counterFp = 0;
uint32_t counterInt = 0;
for (long idx = 0; idx < objective; idx ++) {
counterInt += 1;
counterFp += 1;
}
cout.precision(20);
cout << counterInt << ", " << counterFp << endl;
}
}

100, 100
10000, 10000
1000000, 1000000
100000000, 16777216
1000000000, 16777216

Note that fixed point math is not without this problem, but for fixed point it’s obvious that small numbers use fewer significants after the leading zeros, and obvious when they are too small to fit in the fixed point representation. With floating point, you can be easily fooled into thinking you have more precision in a value than you can use for an addition operation, because the usable precision is dependent on the value you’re adding it to.