To dither means to add noise to our audio signal. Yes, we add noise on purpose, and it is a good thing.
How can adding noise be a good thing??!!!
We add noise to make a trade. We trade a little low-level hiss for a big reduction in distortion. It’s a good trade, and one that our ears like.
The problem results from something Nyquist didn’t mention about a real-world implementation—the shortcoming of using a fixed number of bits (16, for instance) to accurately represent our sample points. The technical term for this is “finite wordlength effects”.
At first blush, 16 bits sounds pretty good—96 dB dynamic range, we’re told. And it is pretty good—if you use all of it all of the time. We can’t. We don’t listen to full-amplitude (“full code”) sine waves, for instance. If you adjust the recording to allow for peaks that hit the full sixteen bits, that means much of the music is recorded at a much lower volume—using fewer bits.
In fact, if you think about the quietest sine wave you can play back this way, you’ll realize it’s one bit in amplitude—and therefore plays back as a square wave. Yikes! Talk about distortion. It’s easy to see that the lower the signal levels, the higher the relative distortion. Equally disturbing, components smaller than the level of one bit simply won’t be recorded at all.
This is where dither comes in. If we add a little noise to the recording process… well, first, an analogy…
Try this experiment yourself, right now. Spread your fingers and hold them up a few inches in front of one eye, and close the other. Try to read this text. Your fingers will certainly block portions of the text (the smaller the text, the more you’ll be missing), making reading difficult.
Wag your hand back and forth (to and fro!) quickly. You’ll be able to read all of the text easily. You’ll see the blur of your hand in front of the text, but definitely an improvement over what we had before.
The blur is analogous to the noise we add in dithering. We trade off a little added noise for a much better picture of what’s underneath.
Back to audio
For audio, dithering is done by adding noise of a level less than the least-significant bit before rounding to 16 bits. The added noise has the effect of spreading the many short-term errors across the audio spectrum as broadband noise. We can make small improvements to this dithering algorithm (such as shaping the noise to areas where it’s less objectionable), but the process remains simply one of adding the minimal amount of noise necessary to do the job.
An added bonus
Besides reducing the distortion of the low-level components, dither lets us hear components below the level of our least-significant bit! How? By jiggling a signal that’s not large enough to cause a bit transition on its own, the added noise pushes it over the transition point for an amount statistically proportional to its actual amplitude level. Our ears and brain, skilled at separating such a signal from the background noise, does the rest. Just as we can follow a conversation in a much louder room, we can pull the weak signal out of the noise.
Going back to our hand-waving analogy, you can demonstrate this principle for yourself. View a large text character (or an object around you), and view it by looking through a gap between your fingers. Close the gap so that you can see only a portion of the character in any one position. Now jiggle your hand back and forth. Even though you can’t see the entire character at any one instant, your brain will average and assemble the different views to put the characters together. It may look fuzzy, but you can easily discern it.
When do we need to dither?
At its most basic level, dither is required only when reducing the number of bits used to represent a signal. So, an obvious need for dither is when you reduce a 16-bit sound file to eight bits. Instead of truncating or rounding to fit the samples into the reduced word size—creating harmonic and intermodulation distortion—the added dither spreads the error out over time, as broadband noise.
But there are less obvious reductions in wordlength happening all the time as you work with digital audio. First, when you record, you are reducing from an essentially unlimited wordlength (an analog signal) to 16 bits. You must dither at this point, but don’t bother to check the specs on your equipment—noise in your recording chain typically is more than adequate to perform the dithering!
At this point, if you simply played back what you recorded, you wouldn’t need to dither again. However, almost any kind of signal processing causes a reduction of bits, and prompts the need to dither. The culprit is multiplication. When you multiply two 16-bit values, you get a 32-bit value. You can’t simply discard or round with the extra bits—you must dither.
Any for of gain change uses multiplication, you need to dither. This means not only when the volume level of a digital audio track is something other than 100%, but also when you mix multiple tracks together (which generally has an implied level scaling built in). And any form of filtering uses multiplication and requires dithering afterwards.
The process of normalizing—adjust a sound file’s level so that its peaks are at full level—is also a gain change and requires dithering. In fact, some people normalize a signal after every digital edit they make, mistakenly thinking they are maximizing the signal-to-noise ratio. In fact, they are doing nothing except increasing noise and distortion, since the noise level is “normalized” along with the signal and the signal has to be redithered or suffer more distortion. Don’t normalize until you’re done processing and wish to adjust the level to full code.
Your digital audio editing software should know this and dither automatically when appropriate. One caveat is that dithering does require some computational power itself, so the software is more likely to take shortcuts when doing “real-time” processing as compared to processing a file in a non-real-time manner. So, an applications that presents you with a live on-screen mixer with live effects for real-time control of digital track mixdown is likely to skimp in this area, whereas an application that must complete its process before you can hear the result doesn’t need to.
Is that the best we can do?
If we use high enough resolution, dither becomes unnecessary. For audio, this means 24 bits (or 32-bit floating point). At that point, the dynamic range is such that the least-significant bit is equivalent to the amplitude of noise at the atomic level—no sense going further. Audio digital signal processors usually work at this resolution, so they can do their intermediate calculations without fear of significant errors, and dither only when its time to deliver the result as 16-bit values. (That’s OK, since there aren’t any 24-bit accurate A/D convertors to record with. We could compute a 24-bit accurate waveform, but there are no 24-bit D/A convertors to play it back on either! Still, a 24-bit system would be great because we could do all the processing and editing we want, then dither only when we want to hear it.)