Session: SPEECH-L7
Time: 1:00 - 3:00, Thursday, May 10, 2001
Location: Room 150
Title: Wideband Speech Coding
Chair: Yair Shoham

1:00, SPEECH-L7.1
A CANDIDATE PROPOSAL FOR A 3GPP ADAPTIVE MULTI-RATE WIDEBAND SPEECH CODEC
C. ERDMANN, P. VARY, K. FISCHER, W. XU, M. MARKE, T. FINGSCHEIDT, I. VARGA, M. KAINDL, C. QUINQUIS, B. KOVESI, D. MASSALOUX
This paper describes an adaptive multi-rate wideband (AMR-WB) speech codec proposed for the GSM system and also for the evolving Third Generation (3G) mobile speech services. The speech codec is based on SB-CELP (Subband-Code-Excited Linear Prediction) with five modes operating bit rates from 24 kbit/s down to 9.1 kbit/s. The respective channel coding schemes are based on RSC (Recursive Systematic Code) and UEP (Unequal Error Protection). Both, source and channel codec are designed as homogenous as possible to guarantee robust transmission on current and future mobile radio channels.

1:20, SPEECH-L7.2
AN EMBEDDED ADAPTIVE MULTI-RATE WIDEBAND SPEECH CODER
A. MCCREE, T. UNNO, A. ANANDAKUMAR, A. BERNARD, E. PAKSOY
This paper presents a multi-rate wideband speech coder with bit rates from 8 to 32 kb/s. The coder uses a splitband approach, where the input signal, sampled at 16 kHz, is split into two equal frequency bands from 0-4 kHz and 4-8 kHz, each of which is decimated to an 8 kHz sampling rate. The lower band is coded using the Adaptive Multi-rate (AMR) family of high-quality narrowband speech coders, while the higher band is represented by a simple but effective parametric model. A complete solution including this wideband speech coder, channel coding for various GSM channels, and dynamic rate adaptation, easily passed all Selection Rules and ranked second overall in the recent 3GPP AMR Wideband Selection Testing. Besides high performance, additional advantages of the embedded split-band approach include ease of implementation, reduced complexity, and simplified interoperation with narrowband speech coders.

1:40, SPEECH-L7.3
OPTIMAL ESTIMATION OF SUBBAND SPEECH FROM NONUNIFORM NON-RECURRENT SIGNAL-DRIVEN SPARSE SAMPLES
P. PENEV, L. IORDANOV
Speech signals are comprised of auditory objects that are localized in time, but can appear anywhere in the record. We introduce a strategy for non-recurrent irregular signal-driven sampling and subsequent maximum likelihood interpolation of speech subbands that achieves object constancy---the representation of an auditory object is precisely locked to the timing of its features, but is otherwise constant. Moreover, the reconstruction fidelity can be traded flexibly for sampling rate, over a broad range of signal-to-noise ratios and application requirements. In an experiment with wide-band speech, we find a regime in the rate/distortion curve that has almost perfect reconstruction at a rate substantially lower than the respective Nyquist rate.

2:00, SPEECH-L7.4
VARIABLE-SIZE VECTOR ENTROPY CODING OF SPEECH AND AUDIO
Y. SHOHAM
Many modern analog media coders employ some form of entropy coding (EC). Usually, a simple per-letter EC is used to keep the coder's complexity and price low. In some coders, individual symbols are grouped into small fixed-size vectors before EC is applied. In this work we extend this approach to form Variable- Size Vector EC (VSVEC) in which vector sizes may be from 1 to several hundreds. The method is, however, complexity-constrained in the sense that the vector size is always as large as allowed by a pre-set complexity limit. The idea is studied in the framework of an MDCT transform coder. It is shown experimentally, using diverse audio material, that a rate reduction of about 37% can be achieved. The method is, however, not specific to MDCT coding but can be incorporated in various speech, audio, image and video coders.

2:20, SPEECH-L7.5
WIDEBAND SPEECH AND AUDIO CODING USING GAMMATONE FILTER BANKS
E. AMBIKAIRAJAH, J. EPPS, L. LIN
Considerable research attention has been directed towards speech and audio coding algorithms capable of producing high quality coded speech and audio, however few of these use signal representations which account for temporal as well as spectral detail. This paper presents a new technique for 16 kHz wideband speech and audio coding, whereby analysis and synthesis are performed using a linear phase gammatone filter bank. The outputs of these critical band filters are processed to obtain a series of pulse trains that represent neural firing. Auditory masking is then applied to reduce the number of pulses, producing a more compact time-frequency parameterization. The critical band gains and pulse amplitudes and positions are then coded using a combination of non-uniform quantization, arithmetic coding and vector quantization. This coding paradigm produces high quality coded speech and audio, is based upon well-known models of the auditory system, is highly scalable, and has moderate complexity.

2:40, SPEECH-L7.6
FREQUENCY SELECTIVITY VIA THE SPENT METHODOLOGY FOR WIDEBAND SPEECH COMPRESSION
M. KOKES, J. GIBSON
In speech and audio coding, frequency selectivity of the basis functions is an important property of the codec. The more precise the frequency selectivity, the less chance there is for audible coding effects due to uncanceled aliasing. In this work, we use Campbell's coefficient rate and the spectral entropy (SpEnt) of the source random process as a guide to formulate adaptive nonuniform modulated lapped biorthogonal transforms (NMLBT). The use of the NMLBT allows for efficient implementation of a time-varying transform which possesses both good frequency and time resolution at all instances. By coupling the SpEnt methodology with the MLBT, we develop band combining strategies to produce an adaptive NMLBT. This new frequency selection process comprises a non-linear approximation method to determine the best N basis functions for a speech frame. We implement a wideband speech compression scheme based on our strategy and verify its improved performance at 16 and 24 kbps.