As an overview, this post is going to cover topics including the Nyquist-Shannon Sample Theorem, Fletcher-Munson curves/charts, what aliasing is and how anti-aliasing is used to eliminate it, what quantization noise is, and finally, how dithering can be used in various ways, such as increasing sampling accuracy over a broad range of samples, or masking problems in audio. If you want to watch the video first, here it is:
Although watching the video is the best way to learn about this topic, because of my illustrations on the whiteboard, I've also put a copy of the audio portion of that tutorial video on SoundCloud, for people who would like to download it to listen to in vehicles, while travelling, etc. Here's the audio-only version:
Nyquist-Shannon Sampling Theorem
So why are CD’s sampled at 44.1 Hz? If film/video is often shown at between 24 and 30 frames per second, why is audio at more than a thousand times that rate? Why not sample at something like one thousand times per second or a nice round number like 10,000 Hz? Well, first of all, in movies, you aren’t sampling a frequency, you’re showing the equivalent of a photograph. Completely different situations. But as for the 44,100 Hz, we first need to understand the bare essentials of the Nyquist Theorem, which I only touched on very briefly in Audio Tutorial #06.
The Nyquist-Shannon Theorem was named first and foremost after a scientist (Harry Nyquist) who published research in 1928 about pulse samples, although that research wasn’t actually exactly about the Theorem that later bore his name. In fact, quite a few different scientists contributed to the subject. And sometimes it’s just called “The Sampling Theorem.” Personally, I’m glad that Claude Shannon got his name attached, because Shannon invented Boolean algebra, which is unquestionably the most important mathematical invention of the 20th century: without it, we would not have computers. Look him up.
The Nyquist Theorem essentially states that if you’re going to capture an audio signal (record a sound) accurately, your sample rate must be at least double what the highest frequency in the signal is. Let me break this down. We’re talking about a situation where a real-life sound (analogue) needs to somehow be converted into a digital representation (sampled). Essentially, the more frequently a sound is sampled, the more accurate the results will be: the digital waveform that is created will be closer to whatever the real waveform originally was. So Nyquist basically stated that in the search to determine what is the “minimum bare acceptable,” taking your highest frequency and doubling it gives you an accurate sample frequency.
Let me also define a term right now that is important. Whatever sample rate you pick, the “Nyquist frequency” is half that rate. So for CD audio, the Nyquist frequency is 22.05 kHz. For DVD-V, which is sampled at 48 kHz, the Nyquist frequency is 24 kHz.
Now of course, the math to back this up is complex, but I don’t want to get bogged down in higher mathematics. Think of it this way: If you don’t take enough samples, you’ll get an inaccurate representation of the original audio signal. I’ve talked about that in the accompanying video. But when you take at least two samples for every oscillation, your representation starts to become fairly accurate. Of course, even higher sample rates would be better and more accurate, but “double the highest frequency” is the bare minimum. And you don't want to go too high above the bare minimum, because that starts to consume excessive computer resources with decreasing incremental gains.
Now, think back to what is considered to be the usual range for human hearing: 20 Hz to 20,000 Hz. Since the majority of people can’t hear anything above 20 kHz, when an audio engineer is doing final mastering on a song, he/she will probably put a filter on the track to try to eliminate frequencies above 20 kHz. Why bother keeping them, if nobody can hear them? So that means that once the mastering is done, the highest frequency is supposed to be around 20 kHz. Use Nyquist, and you’ll see that double that number is 40 kHz, which should be our minimum effective sample rate to hear an accurate representation of the audio.
But wait, 40 kHz is not the same as 44.1 kHz! Well, you have to understand that high-cut filters don’t work perfectly at an exact frequency. It’s more of a roll-off. So if you’re trying to cut everything above 20 kHz, you’ll still have a bit of stuff at 21 kHz and 22 kHz coming through, although it’ll be quite diminished. So some sources say that when the people who wrote the standards for CD’s were trying to come up with a number, they picked 22.05 kHz as being the highest frequencies that really mattered. So double that was 44.1 kHz. And that became the new standard, even though it was a somewhat arbitrary number. Mind you, other sources say that it relates to the fact that video tape was originally used for digital mastering of CD’s and give a highly technical (and plausible) proof of the math as related to video standards. And some other sources point out, perhaps just for fun, that 44,100 is the product of the first four prime numbers squared (two^2 times three^2 times five^2 times 7^2).
Whatever the actual reasoning, the main thing is that people can’t generally hear frequencies above 20 kHz, so the Nyquist Theorem says that they have to be recorded with a sample rate of at least 40 kHz, and for some reason a slightly more conservative number of 44.1 kHz was picked for CD's, and remains the standard to this day.
A Fletcher-Munson curve is used to represent ranges of "equivalent loudness" at various frequencies. This is a fairly subjective measure, since a person has to estimate the perceived volume of a sound, but tests of large samples of the population have given some fairly detailed results over time. Essentially if you pick a line on the graph, and follow it, you'll be able to see what volume for any particular frequency is required to be "equivalent" in perceived volume to a different frequency at a different actual volume. Here's a chart:
Aliasing and Anti-Aliasing
If an engineer didn’t filter out frequencies above 20 kHz, what would happen? Well, the simple answer is that those frequencies would “still be there” even though we couldn’t hear them. The problem would be that these inaudible frequencies would get sampled. Any frequencies that are at higher levels than half the sample rate don’t get sampled accurately. The equipment doing the sampling perceives a different waveform than what it’s actually looking at.
There is actually a mathematical way to predict the “fake” frequency that the A->D converter perceives. It is the sample rate minus the frequency. So if you had audio at 33.1 kHz going through something being sampled at 44.1 kHz, the converter thinks that it is hearing a waveform with a frequency of 44.1-33.1 kHz, or 10 kHz. So you get artifacts at the 10 kHz frequency in your audio. The 10 kHz frequency is thus called the “alias” of the original frequency, its false identity. To further complicate matters, consider that every sound has harmonics. So a tone at 10 kHz produces harmonics at 30 kHz (among other frequencies), so you also have to consider the affects of alias problems from those harmonics.
Anti-aliasing is very simple. It is the name for the process whereby the high frequencies are filtered out so they don’t create aliases. I referred to this already in the previous section: anti-aliasing is basically just the application of a high-cut filter to eliminate the high frequencies that aren’t needed, so they don’t create aliases (artifacts and distortion) in the good, audible part of the frequency spectrum. By the way, anti-aliasing is also used extensively in graphics, and one of the links at the bottom of this post has some good information re. the graphical applications of anti-aliasing.
When you're taking a sample of an instantaneous signal level (ie. analogue-to-digital conversation, or ADC), the difference between your recorded or stored value of the measurement and the true value of the signal is called the quantization noise. Basically, this error is causing by rounding or truncation of data during the sampling of the signal. It can also happen during signal processing and data communication. So in other words, quantization noise is the minor errors in accuracy during any of these processes. Luckily, if quantization noise becomes a problem in your audio, it might be possible to mitigate that with the use of dither.
When calculations are performed on audio data, certain patterns arise. That’s because the calculations are all mathematically based, so the results are the same no matter how many times you try the calculation over. Through a complicated process, these calculations can produce audio artifacts in consistent parts of the frequency spectrum that the human ear can notice slightly. The process of down-sampling from 24 to 16-bit can cause those same unwanted patterns. We want to get rid of those patterns, to make the audio sound smoother. And as noted above, we can also have problems with quantization noise that occurs during the sampling process.
Dithering is a process by which a tiny bit of random “noise” is added during processing, and it has the effect of “smoothing out” anomalies. A real-world attempt at an analogy? Let’s say that you’ve got a pool of water that is perfectly still except for the fact that there is a bag of golf balls hanging over it, and a golf ball drops out of the bag into the water once every three seconds. That disturbance, where the golf balls keep hitting, is very obvious. However, if in addition to the golf ball, there are tons of small pebbles landing all over the surface randomly, the disturbance of the golf ball is a lot less obvious. The other small bits of noise help “drown out” the obvious disturbance. I guess that a more realistic analogy would be on a golf course. If you shank a ball into a water trap on a calm day, it’s easy to see it land in the water. But if there is rain disturbing the surface of the water, it’s a lot harder to notice the golf ball hitting. Think of the obvious disturbance of the golf ball as being analogous to the audio artifact that we need to mask, and the constant disturbances from the rain as being our noise for dithering.
The availability of excellent dithering algorithms on most systems today, combined with 24-bit recording capabilities (which means the noise floor in a digital system is extremely low) means that you don’t really have to worry about recording signals at a fairly low level and then having to deal with lower-resolution quantization noise, or systemic noise. So when you’re recording a multi-track project, you don’t have to try to get every single track up around -5 to -3 for best results. You can probably record everything down around -12 to -10 and give yourself lots of headroom to work with during mixing, without running into noise problems.
If you’ve done your project at one level and want to down-sample the final result (ie. converting a 24-bit session to a 16-bit track destined for CD), you take that final version of your song and convert it. There will usually be an option in your audio editor that asks if you want to apply dither when down-sampling. There are also lots of complicated options and algorithms that can be applied, with respect to dither types and noise-shaping. That’s beyond the level of discussion that we want to get into today. Just go with the defaults if you’re not sure what to pick. If things sound funny after the down-sample, try against with a different algorithm.
Obviously, I’ve covered these subjects in a fairly superficial manner. Baby steps. Hopefully, if you watched the video, that gave you a lot of additional insight. Now you know the general theory behind these subjects that are important to audio engineers. If you want to do further research on your own, I’ll put some links here now. Be forewarned! The physics and mathematics behind these topics can be pretty intense! Especially with dithering algorithms.
If you’ve read all the way through this, you obviously want to learn more about audio recording and music production work. I don’t have a ton of written tutorials like this online, but I do have quite a few detailed YouTube videos that you might enjoy. I've got an organized list of those videos in the index of my "videos" page on my main website. If you're interested in any of those topics, you should bookmark this page right now:
Thanks for your interest in this series, and thanks for sharing this post or links to any of the videos.
Follow Jonathan Clark on other sites:
Main Site: www.djbolivia.ca
Music Blog: djbolivia.blogspot.ca
If you enjoy my tutorials, and want to make a small donation to help purchase additional video equipment to use in future tutorials, here's my Bitcoin wallet address with a QR version: 19VhVFnw76Vor86SDoN2CSLcarQeZZqysE