SP-6.1

Performance assessment of tandem connection of enhanced cellular coders
Simao F Campos Neto, Franklin L Corcoran (COMSAT Laboratories), Ara Karahisar (Teleglobe International)

The growth and increased competition in the second-generation (digital) cellular communication market has led service providers to improve the speech quality in their systems by introducing enhanced speech coders. Advancements in speech coding allowed designers to aim at toll-quality for these enhanced coders, and investigation of the impact of speech coders on the end-to-end quality of the public switched telephone network (PSTN) is necessary. This paper presents the continuation of a series of studies on the impact of tandem connection of cellular systems, where the quality of the enhanced cellular coders for major systems in use today is studied in the context of PSTN interconnection. A major conclusion of this study is that deployment of enhanced coders in second-generation cellular systems makes possible a substantial increase in quality of the cellular connections when in tandem with other speech coders in long haul international networks.

SP-6.2

TTS based very low bit rate speech coder
Ki-Seung Lee, Richard V. Cox (AT&T Labs-Research, SIPS Lab)

This paper addresses a speech coder which uses a Text-To-Speech (TTS) synthesis system to achieve very low bit rates (sub 1kbps). The main issue of the work is the accurate coding of the pitch(F0) and gain contours which are principle components of prosody. This is of paramount interest since the correct prosody will increase naturalness and an efficient coding scheme will provide high coding gain. Together with the phonetic transcription, the F0 and gain contour constitute the parameters that are necessary for the TTS system to synthesize the speech signal. Piecewise linear approximation is used to code the F0 parameter. A technique which minimizes bit rate while maintaining F0 error below a given threshold are described. To obtain both high compression and smoothly changing gain contours, the variance of the signal is averaged over each half phoneme length is transmitted as gain information. With single speaker stimuli, and a priori text transcription information, we obtained naturally sounding speech at an average bit rate of about 300 bps.

SP-6.3

WIDEBAND SPEECH CODING WITH TOLL QUALITY BASED ON IA-MODEL
Ling Kok Ng, Gang Li (School of EEE, Nanyang Technological University, Singapore 639798), Xiao Lin (Center for Signal Processing, Nanyang Technological University, Singapore 639798), Guoan Bi (School of EEE, Nanyang Technological University, Singapore 639798)

In this paper, we propose an instantaneous amplitude (IA) based model for speech signal representation. This can avoid the difficulty in dealing with the time-varying phases and allows us to perform an optimization procedure easily such that the synthetic signal can be made as close to the original one as possible. A simplified frequency-picking algorithm is derived to shorten the processing time while still maintaining the quality of the synthetic speech. Experiments show that the synthetic speech with the developed technique is of toll quality and almost perceptually indistinguishable from the original speech. Initiate work on the coding of the parameters, for a 16kHz sampled speech, for the IA model is done and a toll quality synthesized speech at a bit rate of 40kbps is achieved.

SP-6.4

4 kb/s Multi-Pulse Based CELP Speech Coding Using Excitation Switching
Kazunori Ozawa (NEC Corporation)

This paper proposes an MP-CELP (Multi-Pulse-based CELP) speech coding at 4 kb/s. In MP-CELP, amplitudes or signs of multi-pulse excitation are simultaneously vector quantized (VQ). In order to improve speech quality for background noise conditions, excitation signal is switched between voiced and unvoiced speech, and the number of pulse is greatly increased for unvoiced speech by restricting pulse locations. Further, in order to improve voiced speech quality, the optimal combination among adaptive codebook lag, pulse location, sign codevector and gain codevector is selected which minimizes distortion by employing delayed-decision search. The subjective evaluation results show that speech quality for 4 kb/s MP-CELP is close to that for ITU-T G.723.1 (6.3 kb/s) and G.729 (8 kb/s) in M-IRS clean speech condition. For background noise conditions, the introduction for the excitation switching and the pulse location restriction significantly improves MOS value by 0.4. However, further improvement is still required, except for interference talker condition.

SP-6.5

An Adaptive Multi-Rate Speech Coder For Digital Cellular Telephony
Erdal Paksoy (Texas Instruments), Juan Carlos De Martin (Polytechnic of Turin), Alan V McCree (Texas Instruments), Christian G Gerlach (Alcatel SEL AG), Anand Anandakumar, Wai-Ming Lai, Vishu Viswanathan (Texas Instruments)

We have developed an adaptive multi-rate (AMR) speech coder designed to operate under the GSM digital cellular full rate (22.8 kb/s) and half rate (11.4 kb/s) channels and to maintain high quality in the presence of highly varying background noise and channel conditions. Within each total rate, several codec modes with different source/channel bit rate allocations are used. The speech coders in each codec mode are based on the CELP algorithm operating at rates ranging from 11.85 kb/s down to 5.15 kb/s, where the lowest rate coder is a source controlled multi-modal speech coder. The decoders monitor channel quality at both ends of the wireless link using the soft values for the received bits and assist the base station in selecting the codec mode that is appropriate for a given channel condition. The coder was submitted to the GSM AMR standardization competition and met the qualification requirements in an independent formal MOS test.

SP-6.6

An Adaptive Post-Filtering Technique Based on The Modified Yule-Walker Filter
Azhar Mustapha, Suat Yeldener (COMSAT Laboratories, Clarksburg, Maryland, USA)

This paper presents an adaptive time-domain post-filtering technique based on the modified Yule-Walker filter. Conventionally, post-filtering is derived from an original LPC spectrum. In general, this time-domain technique produces unpredictable spectral tilt that is hard to control by the modified LPC synthesis, inverse and high pass filtering and causes unnecessary attenuation or amplification of some frequency components that introduces muffling in speech quality. This effect increases when voice coders are tandemed together. Another approach of designing a post-filter was developed by McAulay and Quatieri which can only be used in sinusoidal based speech coders. We have also developed another new time-domain post-filtering technique. This technique eliminates the problem of spectral tilt in speech spectrum that can be applied to various speech coders. The new post-filter has a flat frequency response at the formant peaks of speech spectrum. Instead of looking at the modified LPC synthesis, inverse, and high pass filtering in the conventional time-domain technique, we gather information about the poles of the LPC spectrum in the new technique. This post-filtering technique has been used in a 4 kb/s Harmonic Excitation Linear Predictive Coder (HE-LPC) and a subjective listening tests have indicated that this technique outperforms the conventional one in both one and two tandem connections.

SP-6.7

A Modular Approach to Speech Enhancement with an Application to Speech Coding
Anthony J Accardi, Richard V Cox (AT&T Labs - Research, Florham Park, NJ 07932)

Ephraim and Malah's MMSE-LSA speech enhancement algorithm, while robust and effective, is difficult to tune and adjust for the tradeoff between noise reduction and distortion. We suggest a means of generalizing this design, which allows for other estimators besides the MMSE-LSA to be used within the same supporting framework. When a modified version of Ephraim and Van Trees's spectral domain constrained signal subspace estimator is used in this manner, we obtain a system with greater flexibility and similar performance. We also explore the possibility of using different speech enhancement techniques as pre-processors for different parameter extraction modules of the IS-641 speech coder. We show that such a strategy can increase the quality of the coded speech and lead to a system that is more robust to differing noise types.

SP-6.8

On Speech Coding in a Perceptual Domain
Gernot Kubin (Vienna University of Technology), W. Bastiaan Kleijn (KTH (Royal Institute of Technology))

In many speech coders, the distortion criterion operates on the speech signal or a signal obtained by adaptive linear filtering of the speech signal. To satisfy computational and delay constraints, the distortion criterion must be reduced to a very simple approximation of the auditory system. This drawback of conventional approaches motivates a new speech coding paradigm in which the coding is performed in a domain where the single-letter squared-error criterion forms an accurate representation of perception. The new paradigm requires a model of the auditory periphery which is accurate, can be be inverted with relatively low computational effort, and which represents the signal with relatively few parameters. In this paper we develop such a model of the auditory periphery and discuss its suitability for speech coding. Our results indicate that the new paradigm in general and our auditory model in particular form a promising basis for speech and audio coding.

< SP-5 SP-7 >

Last Update: February 4, 1999 Ingo Höntsch