Floating point caveats

The equivalent of “When I was your age, we use to walk to school in the snow. Barefoot”, in DSP, is to say we used to do DSP in fixed point math. The fixed-point system could be made of integer and shift operations in software, or built into fixed-point DSP chips such as the 56K family, or other hardware implementations. Floating point was not available in dedicated DSP chips, and took too many cycles on a CPU. But the general usefulness of floating point lead to it being optimized in CPUs, making it the overwhelming choice for DSP today.

A floating binary point allows floating point math a huge range. It essentially gives us a magic ruler that can be stretch or shrunk at will, to measure huge or tiny distances with the same relative accuracy. The catch is the word “relative”. There are the same number of tick marks on the ruler at any size, and that makes for some properties that might not be immediately obvious. Besides the effects of math involving quantized samples, significant errors to watch out for are in addition, multiplication, and denormals. We’ll cover denormals in the next post.


Floating point is well defined in IEEE standards that are widely adopted in order to give consistent results from one computer and processor to another. There are many ways to do floating point, but these are the only ones we need consider.

“Single precision” (32-bit) floating point, “float” in C, is ample to hold samples of a digital audio stream. It has a sign bit, 23 bits of mantissa, and 8 bits of exponent. But floating point execution units typically keep results in a “normalized” state, where the leading digit is always a “1” and the exponent is adjusted accordingly. This makes best use of the available mantissa bits. Since the leading bit is always 1, it’s omitted, yielding 24 mantissa bits, effectively. But we have a sign bit too, so it’s equivalent to 25-bit audio. It matches up conveniently to 24-bit audio converters.

Double precision (64-bit), “double” in C, gives substantially more of everything, at little cost in processing time (though twice the memory use). A sign bit, 52 bits of mantissa, and 11 bits of exponent. The increase in exponent isn’t important for audio, but the 54 bits of precision is a big plus to minimize accumulated errors.

For an idea of how the two compare, precision-wise, consider that there are 31557600 seconds in a year. The 25 bits of effective precision (if using the full range of negative to positive) in a 32-bit float has a range of 33554432 (225), a resolution of about a second out of a year. A 64-bit double has a range of 18014398509482000 (254), for a resolution of about two nanoseconds.

For fixed point, it depends on the implementation but the standard is the 56k family, which multiplies 24-bit (leading decimal point) numbers for a result with 48 bits to the right of the decimal point and 8 to the left. The extra 8 bits allow headroom for multiply-accumulate cycles to grow greater than one.

Which size to use

32-bit floats are ample for storage and for the sample busses between processing—the 64-bit sample size used for buffers by some DAWs is overkill. 32-bit floats are adequate for many computations, on par with 24-bit fixed point systems (better in most regards but falling short in some—the 56K’s extended-precision accumulators allow long FIRs to be more precise). Either size has far more range in the exponent than we’ll use—audio simply doesn’t require that incredible dynamic range.

But CPUs are heavily optimized for double-precision floating point, and the added precision is often necessary in audio DSP algorithms. For typical modern Mac/PC host processors, stick with doubles in your DSP calculations; the performance difference is small, and your life will be a lot easier, with less chance of a surprise. But you may need to use single-precision, especially in DSP chips or SIMD and some processors where the performance difference is large, so you should understand the limitations.

Errors in multiplication

When you multiply two arbitrary numbers of n digits of precision, the result has 2n digits of precision. 9 x 9 = 81, 99 x 99 = 9801, .9 x .9 = .81. But floating point processors do not give results in a higher precision—the result is truncated to fit in the same number of bits as the operands (essentially, .1111 x .9999 = .1110 instead of the precise .11108889, for an example of 4-digit decimal math). This means an algorithm using single precision floats may have significantly more error than the same algorithm done in a fixed point DSP like the 56K family, which allows accumulation of double-precision results. You need to consider the effects of this error on your algorithm. Or, just stick with double precision computation, which has so much precision that you can usually ignore this truncation error for audio (but watch out for things like raising values to high powers—you can’t freely code arbitrarily high-order IIR filters even with doubles, which is why we usually factor higher orders into a cascade of biquads).

Errors in addition

If you’ve worked with fixed-point math, you’re used to the idea that you discard precision from multiplying when storing back to the original sample size, but addition is basically perfect (as long as you don’t overflow, by either guarding against it, or using an accumulator with headroom). With floating point, you have the same fundamental issues as with multiplication (but with no choice), and a whole new problem with addition. I think this is something many don’t consider, so this section might be my most important point.

Floating point doesn’t do well when adding two values of vastly different sizes.

If our magic ruler is stretched to measure the distance from here to the sun, each tick is bout 5 km, so adding a measurement of a bug’s wing to it will do nothing—it’s too small of a distance to make it to the next tick, the next highest possible number. (With zero placed at the midpoint of the span, the effective mantissa range of a 32-bit float is 225, or 33554432. The scale is our distance to the sun, 150 million km—150000000 km / 33554432 is about 5 km per “tick”. For doubles, it’s half the width of the finest human hair!)

This can be a huge problem in iterative math. Consider that a 32-bit unsigned integer has a maximum value of 4,294,967,295. As long as you don’t exceed that amount, you can count through every value from 0 to that maximum by adding one for each count. A 32-bit float has a maximum value of 3.402823 × 1038, but can count through a maximum of 16,777,216 steps incrementally, no matter the size. After that, the increment is too small compared with the size of the running total, and has no effect.

For instance, a big mistake would be to use a 32-bit float as a timer, incrementing every sample period, to use in an automation ramp calculation. Even if the time increment is a very small number, the timer will stop incrementing after six minutes because the increment’s mantissa can no longer be aligned with that of the running total. 6.3 minutes * 60 seconds/minute * 44100 samples/second is about 224 (16777216) samples—we’re assuming only positive values here. The good news is that when using doubles, it would take a few thousand years to hit that point.

Try this:

// Incrementing integer and float
#include <iostream>
#include <cstdint>
using namespace std;

int main() {
    const long objectives[] = { 100, 10000, 1000000, 100000000, 1000000000 };
    for(const long objective : objectives) {
        float counterFp = 0;
        uint32_t counterInt = 0;
        for (long idx = 0; idx < objective; idx ++) {
            counterInt += 1;
            counterFp += 1;
        cout << counterInt << ",  " << counterFp << endl;
100,  100
10000,  10000
1000000,  1000000
100000000,  16777216
1000000000,  16777216

Note that fixed point math is not without this problem, but for fixed point it’s obvious that small numbers use fewer significants after the leading zeros, and obvious when they are too small to fit in the fixed point representation. With floating point, you can be easily fooled into thinking you have more precision in a value than you can use for an addition operation, because the usable precision is dependent on the value you’re adding it to.

This entry was posted in Math. Bookmark the permalink.

9 Responses to Floating point caveats

  1. Bart says:

    I was looking forward to new earlevel.com articles, and here we are! Very good article as always. I knew about the addition problem of floating points, but i never came across the stretching ruler comparison, a great way to look at it!
    For higher order filters, it might be worth referring to cascading into second order sections. (Although I still don’t understand exactly why that helps)

    • Nigel Redmon says:

      Yes, I did consider elaborating on “you can’t freely code arbitrarily high-order IIR filters”, but didn’t want to go off on a tangent about something covered elsewhere in the site. Maybe I’ll add a few words or a link (or maybe just insist everyone read my whole site 😉

      On why it helps…it’s complicated to say what order you can get away with, because it depends on what filter topology and pole and zero locations. For instance, direct form biquads have bad quantization effects near the unit circle a zero and π, but very good in the middle. If you need to place a pole near the unit circle, it’s disastrous if the pole quantizes to the unit circle or outside it. But in a nutshell it’s this: A high order recursive filter has more points of quantization error, and they are all recirculating, affecting all the poles and zeros. Splitting the filters up isolates and limits the errors.

  2. Bart says:

    Indeed i was thinking of a link to e.g. the cascading topic within your own, over the years elaborate blog, ie. https://www.earlevel.com/main/2016/09/29/cascading-filters/ provides insight in how to do this properly.
    Since we like to compare numeric behaviour to facts about the more tanglible or intuitive physical world, in order to get a deeper understanding, does following metaphor make sense…? Cascading low order filters is like applying multiple thin layers of varnish on a surface. The positive and negative errors in thickness will average out over the surface. (Chemical effects aside) one thick layer will have intensified errors at e.g. corners.
    Probably a mathematician would try to prove the cascading advantage in an exact manner without relying on any intuition, but I guess that would get very complicated.

    • Nigel Redmon says:

      Yes, that’s a good link, I’ll add…

      Mathematically, a biquad has a second order, quadratic, equation in both numerator and denomination—the highest order is the square. As you can imagine, moving up in powers (filter orders) pumps up the effects of any quantization order quickly as you add in higher exponents.

      But note that running a higher order filter as a cascade of biquads requires no more storage (same number of unit delays), no more cpu.

  3. Bart says:

    Maybe all this does not belong to above topic but… well the “higher order exponents pump up quantization” explanation I still fail to understand, since our numeric implementation does not use exponents, but uses delay lines in the form of state variables of a filter. So no higher order exponents are used in the actual calculation.
    I tried to find an explanation for the cascading benefits on the web and found one that leans to what you stated earlier. Wow it starts to make sense while typing this… In non-cascaded form, a quantization of one coefficient will affect ALL poles or ALL zeros, so e.g 4rd order, each pole position is affected by the quantization of each of the 4 coefficients in the numerator of the transfer function. So each pole is moved 4 times. When cascading 2 second orders as an alternative, each pole is only moved twice. The sections dont affect each other’s poles. I guess i need to think of a better metaphor than layers of varnish 🙂

    • Nigel Redmon says:

      Yes, that’s a good way to look at it. And positioning of poles and zeros in higher order filters is more critical (why else would someone choose a higher order, high computational cost filter if there weren’t going to make critical use of them?), so you really end up with a different filter (and possibly an unstable one). At least you can bake in smaller errors at each stage with a cascade.

      On order: OK, you don’t see the powers, but remember, a biquad is a discrete realization of am analog filter of order 2, etc. You may see only summing, multiplying and unit delays, but what happens when you feed that back…And it really adds up quickly. Developing my first plug-in years ago (Amp Farm!), I was pretty surprised when I checked the impulse response of a routine biquad in the 100-200 Hz area—I’d assumed the 56k’s precision was plenty. And at 192k, the threshold would be four times worse. And that’s just a second order filter. Floating point is more forgiving, and especially doubles, but it’s surprising how quick the problem escalates with filter order.

      Found this, search “Design IIR Filters Using Cascaded Biquads”

  4. robert bristow-johnson says:

    even though i have lazily written audio processing code in pure floating-point (because i was using a SHArC or it was native PC) i agree that careful, fixed-point processing is best. 32-bit fixed point with 6 dB headroom is 34 dB better S/N than 32-bit IEEE floating point.

    get used to using “int32_t” and “int64_t” in C/C++. and remember that the C/C++ standard is stupid and the immediate result of multiplying two int32_t variables is another int32_t, not a 64-bit value, and your most-significant 32 bits are truncated.

    so you have to cast one of your int32_t variables to 64-bit first, and *then* multiply (and the result will not have truncated the top 32 bits). if the compiler is smart, maybe it will know that the multiplication of the casted 32 bit int will not require a sign extension, and that it does not need to do four 32×32 mults into a 64 bit result register but only one.

    maybe we should have more of a discussion of doing efficient DSP in C/C++. like how to do delay lines and how to deal with the wrap-around of a circular buffer. there are at least 2 qualitatively different techniques i am aware of. and if you’re processing in sample blocks (which you don’t, but i do), then that sample block size might have to be part of the design of the optimal circular buffer.

    • Nigel Redmon says:

      Good comments, Robert, agreed. And just to make it, clear, I write primitives as a single sample process. but they end up in larger processes that handle blocks, always. That’s why I write them inline (and avoid function call overhead). So that its functionally the same to write the primitive and write a loop using it, as just writing a loop to do everything, but it’s more flexible. (And to be clear, a “loop” might involve a lot of other stuff, it may employ template meta programming and the loop may execute unrolled, etc.) I’ve seen DSP libraries that always process blocks, to avoid function call overhead, and it’s awkward to use those primitives to write a bigger process like a reverb algorithm, especially processes with feedback paths.

Leave a Reply

Your email address will not be published. Required fields are marked *