Convolution is a convoluted topic—and that’s what it means (convoluted, from Merriam-Webster : “Extremely complex and difficult to follow. Intricately folded, twisted, or coiled.”).
Really, it’s more difficult to explain why you would want to use convolution than it is to explain the mathematical function itself. I wrote a more technical article nearly a year ago, and it went unpublished because I didn’t have time to write the interactive and animated graphs that I wanted to accompany it. Revisiting the topic, I decided it was better to explain it in words from an intuitive point of view, followed by an article on the mathematical implementation later, and audio examples.
I hope that most people are familiar—from either personal experience, or maybe a cartoon—with the effect of an echo off a distant canyon wall. You shout, and moments later you hear your shout repeated back to you, though not as loud (the original in red, the quieter echo in blue):
If we gave an impulse—perhaps firing a starter piston—we’d hear a response that has the same spacing and amplitude:
Note that we don’t need to go to that canyon to get the same results in a recording studio—we could mix together a shout of “Hello!” with an attenuated and delayed copy of it. Our impulse response tells us, precisely, how much to attenuate and to delay the copy.
Now, consider what happens when you continue to shout instead of pausing to hear the reflection:
A nearby listener would hear the original speech, starting at the beginning (the first pop), and a delayed, quieter copy starting at the time of the second pop. The two speeches would be jumbled together.
Now consider being inside of an empty gymnasium, where you hear not just one discrete echo, but many, including echoes of echoes as the sound bounces between walls. We could get an impulse response of the gym with a starter pistol, and it too would tell us where to overlap copies—of whatever speech or sound might go on in the gym—and their relative volumes.
As you can imagine, piecing together a facsimile from a signal (speech, music…) and the room’s known impulse response gets more complicated (convoluted!) as the impulse response has more features. In the digital realm, our “features” are individual samples, so the complexity is determined by how many samples there are in the impulse response—the longer the impulse response, the greater the number of computations required to scale and add in “copies”. You won’t want to simulate the results of a reverberant room manually like we did with the single-echo example. Fortunately, we can do much better—we can compute the results exactly, given an exact impulse response. We say that we convolve the signal with an impulse response—the process is called convolution (just like we multiply two numbers in a process called multiplication).
More on getting the impulse response
There are many ways to generate an impulse. Have you ever gone into a near empty gym or warehouse and clapped your hands together sharply once, to hear the “sound” of the room? You were analyzing its impulse response. Popping a balloon is another way. A perfect impulse has equal amounts of all frequencies—like white noise condensed into a spike. It’s impossible to attain this ideal impulse, but we can get close enough to handle the audio band. Often, however, impulse responses of large rooms are taken by sweeping a sine wave through the audio band—a “chirp”—because it’s easier to get a more accurate result, and better signal to noise ratio, than trying to make a loud impulse that’s practically ideal. In essence, a chirp is an impulse spread over time.
In the digital realm, and impulse can be readily approximated by its band-limit version—a single unit sample in the midst of zero samples. To get the impulse response of a digital filter, for instance, run this single-sample impulse through the filter—the impulse response is its output. For an FIR filter, the impulse response is equal to its coefficients (because, conversely, standard FIR filters are normally implemented by convolution).
And, of course, we can compute an impulse response instead of measuring it. We do this routinely for FIR filters. And to combine two serial FIR filters into one, just convolve their impulse responses (which is to say, their taps). We could calculate the response of an imagined room as well, for use as a reverb effect.
Using convolution for audio effects
For changing signals such as music, longer delays have less correlation, and sound like echos, while shorter delays cause more frequency cancellation and sound like filtering. This allows us a wide range of tonal and spacial effects for audio via convolution.
And while I use the term “impulse response” throughout this article, there’s nothing stopping you from convolving any two sounds, including instruments—a trumpet note convolved with a bowed cymbal, for example.
Convolution is a useful tool for reproducing linear, time-invariant effects.
Linear means that the output simply scales with the input at a constant ratio. An identical input signal half as loud, produces the same output half as loud. Examples of linear effects are typical fixed filters and echos. A distortion pedal is non-linear—playing louder creates not just a louder version of the same sound, but a different sound.
Time-invariant means that the impulse response doesn’t change over time. If you input a signal to a time-invariant system right now, the output will sound the same as doing it five minutes from now—nothing changes except the five minutes. A flanger is not time-invariant. Playing right now, your signal might start at the top of the sweep, while playing at an arbitrary time later it might start mid-sweep or at the bottom.
Convolution is not convenient for time-varying effects, as they would require that the impulse response change constantly. You could do this—cycle through changing, possibly interpolated impulse responses, but that’s not a practical solution for most effects.
Likewise, convolution for non-linear effects would require a different impulse response for different instantaneous levels at the input. To be fully general, that would be for every possible input level (65,636 for 16-bit resolution), though more practically most effects could be done by using much fewer levels and interpolation, because good-sounding audio processes are not completely random—the saturation level of a distortion effect rolls on gradually and monotonically, it doesn’t jump all over the place. Still, convolution loses much of its appeal for non-linear effects, because most non-linear effects can be done more simply other ways.
Even though convolution has been used in filtering since the dawn of digital audio, most musicians are aware of the term from convolution reverb. Convolution reverb is a boon for giving people access to “realistic” acoustic spaces, but it shares all of the limitations, and more, with algorithmic reverb. It’s an exaggeration to say that it puts your instruments in a real space—more like it puts your instruments through a speaker (or speakers) in a physical space, and gives you the sound mic’d at a point in that space (with multiple mics for multiple impulse responses for stereo and other multi-channel effects processing). And you lose the chance of capturing non-linearities and time variations, which may play a part in some spaces.
Want the effect of a sound coming from within a closed cardboard box? Generate an impulse inside the box, and capture it outside the box. Need the effect of someone shouting for help from inside a storm drain for a movie without making the actor climb into the storm drain? Maybe you can lower a sound generator into a storm drain and mic it from the outside, then convolve the actor’s voice with that impulse response—and no one needs to get dirty. The sound of a telephone or other small speaker? Wire an electrical impulse directly, and mic the output to get the response.
The web has many pre-made impulse responses, so we can use spaces that we don’t have access to. Play your pipe organ sample via the room response of a large cathedral—or play your guitar through a classic spring reverb, played through an antique radio…