Session: SPEECH-P4
Time: 9:30 - 11:30, Wednesday, May 9, 2001
Location: Exhibit Hall Area 3
Title: Topics In Speech Coding
Chair: Bastiaan Kleijn

9:30, SPEECH-P4.1
APPROXIMATING AND EXPLOITING THE RESIDUAL REDUNDANCIES -- APPLICATIONS TO EFFICIENT RECONSTRUCTION OF SPEECH OVER NOISY CHANNELS
F. LAHOUTI, A. KHANDANI
Exploiting the residual redundancy in a source coder output stream during the decoding process has been proven to be a bandwidth efficient way to combat the noisy channel degradations. In this paper, we consider soft reconstruction of LSF parameters in IS-641 CELP coder transmitted over a noisy channel. We propose two schemes. The first scheme attempts to exploit the interframe residual redundancies in the sequence of received parameters. The second approach exploits both interframe and intraframe residual redundancies. Simulation results are provided which demonstrates the efficiency of the algorithms. Another issue addressed here, is a methodology to efficiently approximate and store the residual redundancies or the a priori transition probabilities. For quantizers with high rates calculating these probabilities require a huge number of source samples. As well, storing them require a large amount of memory. These issues can well make the decoder design process an impractical task. The proposed method is based on the classification of the signal domain. The presented schemes provide high quality error concealment solutions for CELP coders.

9:30, SPEECH-P4.2
CHANNEL OPTIMIZED MATRIX QUANTIZATION (COMQ) FOR LSP PARAMETERS OVER WAVEFORM CHANNELS
J. PÉREZ-CÓRDOBA, A. RUBIO, J. LÓPEZ-SOLER, V. SÁNCHEZ
Combined source and channel coding is a technique to mitigate channel errors without increasing the bit error rate. Channel optimized vector quantizer (COVQ) performs these objetives in the context of vector quantization. This paper presents a study of channel optimized matrix quantizer (COMQ) applied to quantize the Line Spectral Pair (LSP) parameters as an extension of COVQ technique. Gaussian and slow-fading Rayleigh channels are considered and GMSK (Gaussian Minimum Shift-Keying) is used as modulation technique. Several channel signal to noise ratio (CSNR) are considered to measure the performance of this system. In addition, for comparison purposes, the performance of other schemes for quantizing the LSP parameters are computed.

9:30, SPEECH-P4.3
HYBRID MULTI-MODE/MULTI-RATE CS-ACELP SPEECH CODING FOR ADAPTIVE VOICE OVER IP
G. RUGGERI, F. BERITELLI, S. CASALE
This paper presents a hybrid Multi-Mode/Multi-Rate, toll quality CS-ACELP coder developed for Voice over IP applications. The coder uses coding modes compatible with the three 6.4, 8, and 11.8 kbit/s coding schemes standardised by ITU-T in G.729. In particular, the algorithm presents 4 coding categories, with an average bit rate ranging between about 3 and 8 kbit/s, that adapt the rate to changes in network conditions.

9:30, SPEECH-P4.4
IMPROVED VOICE ACTIVITY DETECTION BASED ON A SMOOTHED STATISTICAL LIKELIHOOD RATIO
Y. CHO, K. AL-NAIMI, A. KONDOZ
This paper presents the behavioural mechanism of a statistical model-based voice activity detector (VAD), featuring a likelihood ratio test for the activity decision. From investigation of the VAD, it is found that detection errors could occur frequently at speech offset regions because of the delay term in the decision-directed parameter estimator, employed for the estimation of an unknown parameter of the likelihood ratio. Hence, this paper proposes a smoothed likelihood ratio so as to alleviate the detection errors at the offset region. Objective test results show that the proposed scheme is useful for achieving a considerable performance improvement for the VAD. Additionally, the proposed VAD gives detection performances superior to G.729B VAD and comparable with AMR VAD option 2.

9:30, SPEECH-P4.5
MULTIPLEXED PREDICTIVE CODING OF SPEECH
S. ANDERSEN, G. KUBIN
In this paper we present a novel method for predictive coding with application to transmission of speech over packet-switched networks. Our method uses multiplexing to distribute a part of the information about a segment of each speech signal in several data packets while keeping the data packet rate and payload for that part of the information unchanged. We investigate three multiplexing schemes: a packet hopping, a Hadamard multiplexing, and an extension of the Hadamard multiplexing that exploits a nonlinear preprocessing and estimation method. We show by means of formal AB-preference tests that multiplexed predictive coding can lead to coders that are more robust to packet losses than scalar quantization and packet loss concealment according to the G.711 standard.

9:30, SPEECH-P4.6
PARAMETER INTERPOLATION TO ENHANCE THE FRAME ERASURE ROBUSTNESS OF CELP CODERS IN PACKET NETWORKS
J. WANG, J. GIBSON
Frame erasure (FE) robustness is an important quality measure for voice over IP networks (VoIP). Recovery of the erased frames from the received information is crucial to realize this robustness. We allow the lost frames to be recovered from both the ``previous'' and ``next'' good frames. We first give quantitative distortion comparisons between predictive and interpolative frame recovery. Then we add FE-robust LSF coding modes to the popular ITU G.723.1 and G.729 CELP coders. These FE-robust modes utilize intraframe LSF VQ and invoke no bit-rate increase for the G.723.1 coder and a small increase (0.4 kb/s) for G.729. Simulations show that FE robust coding with interpolation achieves average spectral distortions 0.7-1.8 dB smaller than that of the original coders. Significant quality improvement was achieved by combined implementation of FE robust coding, LSF and pitch interpolation, and a proposed fixed codebook excitation recovery method.

9:30, SPEECH-P4.7
A SPEECH SPECTRUM DISTORTION MEASURE WITH INTERFRAME MEMORY
F. NORDÉN, T. ERIKSSON
In this paper we present a novel spectral distortion measure with interframe memory. The memory gives the possibility to take into account the dynamics of the time evolution of the speech spectrum, which has shown to have a significant importance on the perceived speech quality. Memory is introduced by linear filtering of the time evolution of the difference log spectrum. This facilitates smoothing of spectrum with a kept ability to track quick transitions. Our results point at a substantially improved performance when rapidly evolving spectrum errors are punished in the measure.

9:30, SPEECH-P4.8
ESTIMATION OF MISSING LSF PARAMETERS USING GAUSSIAN MIXTURE MODELS
R. MARTIN, C. HOELPER, I. WITTKE
Speech transmission over packet networks has to cope with packet delays and packet losses. When a packet loss occurs the missing information must be estimated. In this contribution we focus on restoring the spectral parameters of a speech coder. A novel approach to estimating missing Line Spectral Frequency (LSF) parameters using Gaussian Mixture Models (GMM) is proposed. We present the estimation algorithm and study its performance when one or several LSF parameters are lost. We show that a GMM of a relatively low order (approx. 20) is sufficient to achieve a substantial improvement in parameter SNR. Therefore, the new estimation procedure requires much less memory than histogram based estimation methods.

9:30, SPEECH-P4.9
PERCEPTUAL EVALUATION OF SPEECH QUALITY (PESQ) - A NEW METHOD FOR SPEECH QUALITY ASSESSMENT OF TELEPHONE NETWORKS AND CODECS
A. RIX, J. BEERENDS, M. HOLLIER, A. HEKSTRA
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862, replacing P.861 which specified PSQM and MNB.

9:30, SPEECH-P4.10
SOURCE-DRIVEN PACKET MARKING FOR SPEECH TRANSMISSION OVER DIFFERENTIATED SERVICES NETWORKS
J. DE MARTIN
We present a source-driven approach to packet marking for speech transmission over packet networks implementing the Differentiated Services model. Packets generated by the speech coder are examined: if deemed perceptually critical, they are marked as premium and sent on a ``virtual wire,'' otherwise, they are sent as regular best-effort traffic. Applied to speech coded with the ITU-T 8 kb/s speech coding standard G.729, the proposed source-driven packet marking scheme outperforms source-transparent techniques and provides clearly better perceptual quality than the unprotected case sending as little as 1/5 of the coded bitstream as premium traffic. Audio samples are available at http://demartin.polito.it/icassp2001/.