Sunday, January 3, 2010

Codecs

When speech is carried on the Internet, it is of course carried in digital form. Speech is also carried digitally through most portions of modern telephone networks (although in the PSTN, it is still normally converted to analog form on the last mile of transmission over analog telephone lines). Having the speech signals available digitally provides the opportunity to use digital speech processing techniques, and particularly speech coders, which can compress the digital bit stream to low bit rates—with trade-offs against increased delay, implementation complexity/cost, and quality. In this section, we will discuss motivations for speech compression, review the basics of coding, discuss some specific coders, and look at the trade-offs that must be understood to decide which kind of coding is appropriate for specific applications (and, in particular, whether compression is desirable).

Motivations for Speech Coding in Internet-Telephony Integration

Classically, speech compression techniques have been used in situations where bandwidth is limited or very expensive. For example, prior to the development of high-bandwidth digital fiber-optic undersea cables, undersea telephone trunks were carried on very expensive analog coaxial cable systems, and the utilization of these expensive trunks was increased by the use of voice activity detection (VAD) techniques. The technique used by these early systems, known as time assignment speech interpolation (TASI), was to detect when speech was present on a given channel and then, in real time, switch it through to one of a set of undersea cable trunks. When the channel went silent due to a pause in the conversation (typically caused by Party A pausing to listen to Party B’s response—in normal, polite conversation, there are only short intervals when both parties talk simultaneously!), the channel would be disconnected and the trunk given over to another active speaker. Given the activity factor of normal conversation, these systems could achieve compression of about 2:1.

Another situation in which the need for speech compression seems obvious is that of wireless transmission—for example, in cellular telephony systems. Over-the-air bandwidth is limited both by government regulation and by the potential for multiple users of the free-space channel to interfere with each other. In addition, due to the tremendously high growth rate of cellular telephony, more and more callers seek to use the same limited bandwidth. As a result, modern cellular telephony standards do indeed provide speech coder options, which achieve various levels of compression and increased utilization of the scarce wireless bandwidth.

Turning to the subject—the integration of the Internet and telephony—the situation is somewhat less clear. On the telephony side, if we are talking about wireline telephony, the cost of bandwidth is a relatively small part of the total cost of providing service. On the Internet side, one of the defining characteristics of the explosive growth pattern that we have seen in recent years has been the utilization of higher- and higher-bandwidth channels to interconnect nodes in the Internet. To a great degree, this is caused by (and, in turn, enables) the running of multimedia applications over the Internet and especially the World Wide Web, involving the transfer of large image files, audio and video streams, and so on. With so much bandwidth, and with so many applications apparently using a larger average bandwidth than ordinary telephone speech, why would anyone want to compress speech on the Internet?

In fact, there are several possible reasons. One simple reason is that bandwidth for access to the Internet is often quite severely constrained, as is the case for consumers who dial up using modems with rates of 56 kbps or lower. In this case, voice and all other active applications have to share a bandwidth that is narrower than that normally provided in the telephone network for speech alone. Another motivation may be integration with wireless networks. As noted earlier, wireless voice typically uses coding for speech compression in order to conserve the scarce, expensive over-the-air bandwidth. So, wireless voice over IP may employ compression for exactly the same reason. Even in the case of an end-to-end path that is only partly wireless, keeping the voice encoded in a compressed format would avoid the degradation of voice quality that can come with repeated encoding and decoding (so-called tandem encoding).

Another whole class of applications for voice coding on the Internet may be revealed if we allow ourselves to step outside the traditional telephony definitions of 4 kHz, point-to-point, real-time voice. For example, integrated messaging applications require the storage of voice signals, and compression can then be important in reducing storage requirements.

Similarly, the flexibility afforded by an all-digital medium may be used to encode voice (or audio) for higher quality, including preserving more of the bandwidth of the input signal and employing multichannel communication techniques (for example, stereo). Applications include the creation of highly realistic teleconferences and the transmission of music. For these applications, efficient coding techniques may be used to keep the consumption of bandwidth by the high-quality service to reasonable levels.

Broadcast applications constitute another class whose needs differ from those of traditional point-to-point telephony. Broadcasting to n locations can consume n times the bandwidth of a point-to-point transmission, so there is a high motivation to code for compression. At the same time, broadcast applications typically are more tolerant of delay, which means that more sophisticated and complex coding algorithms may be used without introducing noticeable impairments.

Voice Coding Basics

Some specific references are given on the subject of voice coding in our bibliography. In this section, our goal is to provide some intuitive understanding of how coders work so that the material that follows makes sense to the general reader.

Everyone probably understands that voices and other sounds in general may differ in frequency (or, in musical terms, pitch). A woman’s voice is usually higher in frequency (or pitch) than a man’s, and a child’s may be higher still. The music of a bassoon is lower in frequency than that of a piccolo. Also, complex sounds actually consist of more than one frequency. In music, the overtones of the saxophone differ from those of the oboe, so that these two instruments sound different even if both are playing the same note. Similarly, we can distinguish between the voices of two people whom we know, even if their voices are about equal in overall pitch, due to the complex structure of frequencies produced by the human voice, which differs in detail from individual to individual.

An idea that is quite fundamental to voice coding is that the range of important frequencies in a given sound (and, in particular, the sound of the human voice) is limited. We need only reproduce this range of frequencies for the transmitted sound to be recognizable and useful. It is true that the quality and naturalness of the sound will be better if more frequencies are transmitted. However, we are all familiar with an example in which very useful voice transmission is accomplished by using a quite limited range of frequencies: In telephone networks, frequencies higher than 4 kHz (4000 vibrations per second) are not transmitted, and, in fact, the actual range is probably closer to between 200 and 3400 Hz. Nonetheless, we can not only understand what is said, but usually we can recognize and distinguish familiar voices as well.

Digital coding starts with sampling the continuous-sound waveform at discrete intervals (see Figure 1). An important theorem states that the waveform may be completely reproduced from samples if the sampling rate is at least twice as great as the highest frequency that is contained in the sound. Now you can see why we were so concerned with the range of frequencies to be transmitted—this tells us what sampling rate is needed. For telephone speech that is limited to 4 kHz, the sampling rate is 2 x 4000 = 8000 samples per second.

Image from book
Figure 1: Sound sampling.

Each of these samples is a measurement, typically of an electrical voltage somewhere inside the coding system. However, in order to be carried in packets on the Internet (or, indeed, on a digital telephone network), the output of the coder needs to be a string of bits. This is accomplished by encoding each voltage measurement sample as a binary number. Another critical parameter is how many bits will be used to encode each sample. Research done over 40 years ago showed that excellent speech reproduction could be achieved with 8 bits per sample. At 8 bits per sample and 8000 samples per second, this implies a bit rate out of the coder of 64,000 bits per second (64 kbps).

The system that we have just described is the most basic form of digital coder, called a pulse-code modulation (PCM) system. In practice, one additional step is taken to improve the performance of such coders and, in fact, to ensure that 8 bits per sample is sufficient to encode the samples. This is companding, in which the dynamic range (range between maximum and minimum values) is compressed at the coder and then decompressed (expanded) at the decoder. Telephone systems in the United States and some other places use a companding formula called m-law, and those in Europe and some other places use a companding formula called A-law, leading to one of those troublesome international standards differences! Both the A- and m-law systems have been standardized by ITU-T as G.711.

By the way, how does the decoder work? For PCM, it is relatively simple to describe. The sequence of 8-bit binary numbers is turned back into a string of voltage pulses, the magnitude of each pulse corresponding to the encoded voltage measurement. This string of pulses is then passed through a circuit called a low-pass filter, which interpolates between the pulses, producing a smooth signal. (In terms of frequency, the action of the low-pass filter is to remove irrelevant high-frequency components; in the case of standard telephony, everything above 4 kHz is removed.) What remains is a close reproduction of the original voice waveform with quantization noise, which represents the slightly inaccurate encoding of the voltage measurements as finite-length binary numbers.

Achieving a Lower Bit Rate

Very simply, the goal of compressed speech coding for telephony is to use less than 64 kbps of bandwidth while preserving desirable characteristics of the speech. Here we will briefly discuss some basic approaches to reducing the bit rate that is required for carrying digital speech below the nominal 64 kbps.

A simple variation of PCM that achieves a lower bit rate with still quite good quality is differential PCM (DPCM). In DPCM, information about the difference between succeeding samples is transmitted instead of their absolute value. This takes advantage of the fact that two succeeding samples will often be quite close in value. With a slightly more sophisticated variant, called adaptive differential PCM (ADPCM), it is easy to get quite excellent speech reproduction while using only half the bit rate of straight PCM (that is, using 32 kbps). ADPCM has been standardized by ITU-T as G.726.

To get lower bit rates, it is necessary to adopt much more sophisticated approaches. Some low-bit-rate coders attempt to take advantage of the fact that the input signal is known to be human speech. Vocoding, in which an electrical model of the human vocal tract is constructed and used as the basis of a low-bit-rate coder/decoder system, is a decades-old idea from speech research that has been made practical by advances in high-speed electronics. Besides vocoders, other types of very-low-bit-rate coders include parametric coders and waveform interpolation coders. The ITU-T has standardized a number of low-bit-rate coders, including the following:

  • G.723.1, low-bit-rate coder for multimedia applications, 6.3 and 5.3 kbps

  • G.728, 16-kbps low-delay code-excited linear prediction (LD-CELP) coder

  • G.729, 8-kbps conjugate-structure algebraic-code-excited linear prediction (CS-ACELP)
For an excellent discussion of these low-bit-rate ITU coders, see Cox and Kroon (1996).
Since we are interested in integrating Internet and telephony, it is important to note that the coder bit rates we have quoted do not, of course, take into account various overheads that are introduced when voice is packetized, compared with circuit-switched voice. Packetization is quite a complex subject in its own right, and outside the scope of our present subject—speech coding. Suffice it to say that, depending on the specific choice of coder, packetization technique, and protocol stack, it is quite possible to use up most (or even all!) of the bandwidth gained in compression through packetization overhead. Other choices can result in a net bandwidth gain compared with uncompressed circuit-switched voice. Obviously, packetization is an area that requires careful attention if achieving actual bandwidth savings is important to the application.

Trade-offs

In spite of the truly impressive advances that have been made in the past few years both in developing more sophisticated algorithms for compression and in high-speed electronics to run them, the world of speech coding still provides many illustrations of the earthy adage: There’s no such thing as a free lunch. In general, lower-bit-rate coders introduce more delay in the signal path, are more complex-expensive to implement, and involve more compromises to voice quality. This section discusses these trade-offs and is intended to help you decide whether you want to use voice compression for your application and, if so, how aggressive you can afford to be.

Delay

Voice communication can be highly sensitive to total end-to-end delay. Excessive delay interrupts the normal conversational pattern in which speakers reply to each other’s utterances and also exacerbates the problem of echoes in communication circuits. Delay is the reason why links via geostationary satellites are, at present, only used on very thin traffic routes in the modern international telephone network. Even with echo cancellation systems in place, the hundreds of milliseconds of delay introduced by the trip up to the satellite and back is very disruptive to conversation, which you will notice immediately if you ever make a call over such a circuit. The strong preference is to use optical fiber routed over the earth’s surface (or under the ocean) wherever it is feasible.

The most fundamental component of delay introduced by speech processing is called algorithmic delay. Algorithmic delay comes about because most speech coders work by doing an analysis on a batch of speech samples. Some minimum amount of speech is needed to do this analysis, and the time to accumulate this number of samples is an irreducible delay component—the algorithmic delay. Another component added by the coder is processing delay, the time for the coder hardware to analyze the speech and the decoder hardware to reproduce it. This component can be reduced by using faster hardware. Cox and Kroon (1996) state that for ease of communication, the total system delay, which includes these coder components plus the one-way communication delay, should be less than 200 ms. The algorithmic delay for G.729 and G.723.1 coders is 15 and 37.5 ms, respectively. Assuming typical processing delays and communication over a serial connection (such as a circuit-switched transport), operating at the bit rate of the coder, the total system delays will be 35 and 97.5 ms, respectively. If a packet network such as the Internet is involved, there may be an additional packet filling delay. For example, for G.729, the coder outputs 80 bits of compressed speech every 10 ms. If the packet size is 160 bits, this means we have to wait an additional 10 ms before we can transmit the packet, thereby increasing the overall system delay.

From an application point of view, you may want to avoid the use of aggressive low-bit-rate coding in situations where the quality of interaction counts for a lot—teleconferencing, for example, or calls that your salesforce makes to customers. By contrast, a one-way voice broadcast would not be much impaired by some extra delay. Another issue to look out for is added delay from other active electronics in the path.

Complexity

The issue of complexity is of direct concern to designers of equipment. The more demanding a speech processing algorithm is of processing power and memory, the bigger and more expensive the digital signal processor (DSP) or other specialized chip needs to be. For the purchaser of equipment, this primarily translates into an impact on price, but possibly to some other parameters of interest, such as power consumption in a wireless handset, which will determine how long you can talk before the battery runs out.

Quality

The tried-and-true method of measuring quality in voice communications, and the one that is still used to evaluate speech coders, is the subjective test of mean opinion score (MOS). This is a test in which people are asked to listen to the speech and rate its quality as bad, poor, fair, good, or excellent. Cox and Kroon (1996) have compiled the results of many MOS tests of ITU-T and other standardized coders.

Behind the seeming scientific nature of mean opinion score testing are many issues that are difficult to quantify. How do the coders perform in the presence of a variety of types of background noise? Can individual speakers be recognized by the sound of their voices? What if the sound is something other than voice (music, for example)? The best thing for a prospective system purchaser to do is listen, of course, and test the system in as close an approximation of the intended environment as possible.

The bottom line is that integration of the PSTN and the Internet presents opportunities to use very sophisticated, modern voice coding techniques, but it is up to you as the system developer or purchaser to decide whether the advantages are worth the cost and potential trade-offs in quality.