The 8000 samples-per-second PCM signal, at 16 bits per sample, results in 128,000 bits per second of information. That's fairly high, especially in the world of wireline telephone networks, in which every bit represented some collection of additional copper lines that needed to have been laid in the ground. Therefore, the concept of audio compression was brought to bear on the subject.
An audio or video compression mechanism is often referred to as a codec, short for coderdecoder. The reason is that the compressed signal is often thought of as being in a code, some sequence of bits that is meaningful to the decoder but not much else. (Unfortunately, in anything digital, the term code is used far too often.)
The simplest coder that can be thought of is a null codec. A null codec doesn't touch the audio: you get out what you put in. More meaningful codecs reduce the amount of information in the signal. All lossy compression algorithms, as most of the audio and video codecs are, stem from the realization that the human mind and senses cannot detect every slight variation in the media being presented. There is a lot of noise that can be added, in just the right ways, and no one will notice. The reason is that we are more sensitive to certain types of variations than others. For audio, we can think of it this way. As you drive along the highway, listening to AM radio, there is always some amount of noise creeping in, whether it be from your car passing behind a concrete building, or under power lines, or behind hills. This noise is always there, but you don't always hear it. Sometimes, the noise is excessive, and the station becomes annoying to listen to or incomprehensible, drowned out by static. Other times, however, the noise is there but does not interfere with your ability to hear what is being said. The human mind is able to compensate for quite a lot of background noise, silently deleting it from perception, as anyone who has noticed the refrigerator's compressor stop or realized that a crowded, noisy room has just gone quiet can attest to. Lossy compression, then, is the art of knowing which types of noise the listener can tolerate, which they cannot stand, and which they might not even be able to hear.
(Why noise? Lossy compression is a method of deleting information, which may or may not be needed. Clearly, every bit is needed to restore the signal to its original sampled state. Deleting a few bits requires that the decompressor or the decoder restore those deleted bits' worth of information on the other end, filling them in with whatever the algorithm states is appropriate. That results in a difference of the signal, compared to the original, and that difference is distortion. Subtract the two signals, and the resulting difference signal is the noise that was added to the original signal by the compression algorithm. One only need amplify this noise signal to appreciate how it sounds.)
1 G.711 and Logarithmic Compression
The first, and simplest, lossy compression codec for audio that we need to look at is called logarithmic compression. Sixteen bits is a lot to encode the intensity of an audio sample. The reason why 16 bits was chosen was that it has fine enough detail to adequately represent the variations of the softer sounds that might be recorded. But louder sounds do not need such fine detail while they are loud. The higher the intensity of the sample, the more detailed the 16-bit sampling is relative to the intensity. In other words, the 16-bit resolution was chosen conservatively, and is excessively precise for higher intensities. As it turns out, higher intensities can tolerate even more error than lower ones—in a relative sense, as well. A higher-intensity sample may tolerate four times as much error as a signal half as intense, rather than the two times you would expect for a linear process. The reason for this has to do with how the ear perceives sound, and is why sound levels are measured in decibels. This is precisely what logarithmic compression does. Convert the intensities to decibels, where a 1 dB change sounds roughly the same at all intensities, and a good half of the 16 bits can be thrown away. Thus, we get a 2:1 compression ratio.
The ITU G.711 standard is the first common codec we will see, and uses this logarithmic compression. There are two flavors of G.711: μ-law and A-law. μ-law is used in the United States, and bases its compression on a discrete form of taking the logarithm of the incoming signal. First, the signal is reduced to a 14-bit signal, discarding the two least-significant bits. Then, the signal is divided up into ranges, each range having 16 intervals, for four bits, with twice the spacing as that of the next smaller range. Table 1 shows the conversion table.
Input Range | Number of Intervals in Range | Spacing of Intervals | Left Four Bits of Compressed Code | Right Four Bits of Compressed Code |
---|---|---|---|---|
8158 to 4063 | 16 | 256 | 0×8 | number of interval |
4062 to 2015 | 16 | 128 | 0×9 | number of interval |
2014 to 991 | 16 | 64 | 0×a | number of interval |
990 to 479 | 16 | 32 | 0×b | number of interval |
478 to 223 | 16 | 16 | 0×c | number of interval |
222 to 95 | 16 | 8 | 0×d | number of interval |
94 to 31 | 16 | 4 | 0×e | number of interval |
30 to 1 | 15 | 2 | 0×f | number of interval |
0 | 1 | 1 | 0×f | 0×f |
−1 | 1 | 1 | 0×7 | 0×f |
−31 to −2 | 2 | 15 | 0×7 | number of interval |
−32 to −95 | 4 | 16 | 0×6 | number of interval |
−223 to −96 | 8 | 16 | 0×5 | number of interval |
−479 to −224 | 16 | 16 | 0×4 | number of interval |
−991 to −480 | 32 | 16 | 0×3 | number of interval |
−2015 to −992 | 64 | 16 | 0×2 | number of interval |
−4063 to −2016 | 128 | 16 | 0×1 | number of interval |
−8159 to −4064 | 256 | 16 | 0×0 | number of interval |
The number of the interval is where the input falls within the range. 90, for example, would map to 0xee, as 90-31 = 59, which is 14.75, or 0xe (rounded down) away from zero, in steps of four. (Of course, the original 16-bit signal was four times, or two bits, larger, so 360 would have been one such 16-bit input, as would have any number between 348 and 363. This range represents the loss of information, as 363 and 348 come out the same.)
A-law is similar, but uses a slightly different set of spacings, based on an algorithm that is easier to see when the numbers are written out in binary form. The process is simply to take the binary number and encode it by saving only four bits of significant digits (except the leading one), and to record the base-2 (binary) exponent. This is how floating-point numbers are encoded. Let's look at the previous example. The number 360 is encoded in 16-bit binary as
0000 0001 0110 1000
with spaces placed every four digits for readability. A-law only uses the top 13 bits. Thus, as this number is unsigned, it can be represented in floating point as
1.01101 (binary)x25.
The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are "0110", and the exponent is 5. A-law then records the number as
0001 0110
where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits.
A-law is used in Europe, on their telephone systems. For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold. The distinctions are now mostly historical.
G.711 compression preserves the number of samples, and keeps each sample independently of the others. Therefore, it is easy to figure out how the samples can be packaged into packets or blocks. They can be cut arbitrarily, and a byte is a sample. This allows the codec to be quite flexible for voice mobility, and should be a preferred option.
Error concealment, or packet loss concealment (PLC), is the means by which a codec can recover from packet loss, by faking the sound at the receiver until the stream catches up. G.711 has an extension, known as G.711I, or G.711, Appendix I. The most trivial error concealment technique is to just play silence. This does not really conceal the error. An additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet's worth—until the stream catches up. The problem is that, should the last sample have had a plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on-the plosive will be repeated, providing an effect reminiscent of a quickly skipping record player or a 1980s science-fiction television character.[*] Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery. This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent.
In many cases, G.711 is not even mentioned when it is being used. Instead, the codec may be referred to as PCM with μ-law or A-law encoding.
2 G.729 and Perceptual Compression
ITU G.729 and the related G.729a specify using a more advanced encoding scheme, which does not work sample by sample. Rather, it uses mathematical rules to try to relate neighboring samples together. The incoming sample stream is divided into 10ms blocks (with 5ms from the next block also required), and each block is then analyzed as a unit. G.729 provides a 16:1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block.
The concept around G.729 compression is to use perceptual compression to classify the type of signal within the 10ms block. The concept here is to try to figure out how neighboring samples relate. Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample). G.729 uses a couple of techniques to try to figure out what the sample must "sound like," so it can then throw away much of the sample and transmit only the description of the sound.
To figure out what the sample block sounds like, G.729 uses Code-Excited Linear Prediction (CELP). The idea is that the encoder and decoder have a codebook of the basics of sounds. Each entry in the codebook can be used to generate some type of sound. G.729 maintains two codebooks: one fixed, and one that adapts with the signal. The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords, which excite the airways. The airways—the mouth, tongue, and so on—are then thought of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat.
The signal is first brought in and linear prediction is used. Linear prediction tries to relate the samples into the block to the previous samples, and finds the optimal mapping. ("Optimal" does not always mean "good," as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said. This is usually a simple sound, an "uhhh" or "ahhh." The linear predictor figures out how the humming gets shaped, as a simple filter. What's left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech. The linear prediction's effects are removed, and the remaining signal is the residue, which must relate to the excitations. The nuances are looked up in the codebook, which contains some common residues and some others that are adaptive. Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete. The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match.
On the other side, the decoding process looks up the codebooks for the excitations. These excitations get filtered through the linear predictor. The hope is that the results sound like human speech. And, of then, it does. However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables. That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks.
G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal. It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion.
3 Other Codecs
There are other voice codecs that are beginning to appear in the context of voice mobility. These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject. These newer coders are focused on improving the error concealment, or having better delay or jitter tolerances, or having a richer sound. One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice applications such as Skype.
Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network. Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network. For example, G.711 is fine in many conditions, and G.729 might not be necessary.