Chair: Richard Stern, Carnegie Mellon University (USA)
Ananth Sankar, SRI International
Chin-Hui Lee, AT&T Bell Laboratories (USA)
We present a maximum likelihood (ML) stochastic matching approach to decrease the acoustic mismatch between a test utterance Y and a given set of speech hidden Markov models Lambda_X so as to reduce the recognition performance degradation caused by possible distortions in the test utterance. This mismatch may be reduced in two ways: 1) by an inverse distortion function F_(nu)(.) that maps Y into an utterance X which matches better with the models Lambda_X, and 2) by a model transformation function G_(eta)(.) that maps Lambda_X to the transformed model Lambda_Y which matches better with the utterance Y. The functional form of the transformations depends upon our prior knowledge about the mismatch, and the parameters are estimated along with the recognized string in a maximum likelihood manner using the EM algorithm. Experimental results verify the efficacy of the approach in improving the performance of a continuous speech recognition system in the presence of mismatch due to different transducers and transmission channels.
Olivier Siohan, CRIN-CNRS & INRIA - Lorraine (FRANCE)
This paper addresses the problem of speech recognition in a noisy environment by finding a robust speech parametric space. The framework of Linear Discriminant Analysis (LDA) is used to derive an efficient speech parametric space for noisy speech recognition, from a classical static+dynamic MFCC space. We first show that the derived LDA space can lead to a higher discrimination than the MFCC related space, even at low signal--to--noise ratio (SNR). Then, we test the robustness of the LDA space to variations between the training and testing SNR. Experiments are performed on a continuous speech recognition task, where speech is degraded with various noises: Gaussian noise, F16, Lynx helicopter, autobus, hair dryer. It was found that LDA is highly sensitive to SNR variations for white noises (Gaussian, hair dryer), while remaining quite efficient for the others.
Yasuhiro Minami, NTT Human Interface Laboratories (JAPAN)
Sadaoki Furui, NTT Human Interface Laboratories (JAPAN)
This paper proposes an adaptation method for universal noise (additive noise and multiplicative distortion) based on the HMM composition (compensation) technique. Although the original HMM composition can be applied only to additive noise, our new method can estimate multiplicative distortion by maximizing the likelihood value. Signal-to-noise ratio is automatically estimated as part of the estimation of multiplicative distortion. Phoneme recognition experiments show that this method improves recognition accuracy for noisy and distorted speech.
M. J. F. Gales, Cambridge University (U.K.)
S. J. Young, Cambridge University (U.K.)
In previous papers the use of Parallel Model Combination (PMC) for noise robustness has been described. Various fast implementations have been proposed, though to date in order to compensate all the parameters of a system it has been necessary to perform Gaussian integration. This paper introduces an alternative method that can compensate all the parameters of the recognition system, whilst reducing the computational load of this task. Furthermore, the technique offers an additional degree of flexibility, as it allows the number of components to be chosen and optimised using standard iterative techniques. The new technique is referred to as Data-driven PMC (DPMC). It is evaluated on the Resource Management database, with noise artificially added from the NOISEX-92 database. The performance of DPMC is found to be comparable to PMC, at a far lower computational cost. In complex noise environments, by more accurately modelling the noise source, using multiple components, and then reducing the number of components to the original number a slight improvement in performance is obtained.
Pedro J. Moreno, Carnegie Mellon University (USA)
Bhiksha Raj, Carnegie Mellon University (USA)
Evandro Gouve, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)
In this paper we introduce a new family of environmental compensation algorithms called Multivariate Gaussian Based Cepstral Normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components of Gaussian mixtures, and an efficient procedure for estimating the correction factors is provided. The RATZ algorithm can be implemented to work with or without the use of stereo development data that had been simultaneously recorded in the training and testing environments. Blind RATZ partially overcomes the loss of information that would have been provided by stereo training through the use of a more accurate description of how noisy environments affect clean speech. We evaluate the performance of the two RATZ algorithms using the CMU SPHINX-II system on the alphanumeric census database and compare their performance with that of previous environmental-robustness developed at CMU.
Leonardo Neumeyer, SRI International (USA)
Mitchel Weintraub, SRI International (USA)
This paper compares three techniques for recognizing continuous speech in the presence of additive car noise: 1) transforming the noisy acoustic features using a mapping algorithm, 2) adaptation of the Hidden Markov Models (HMMs), and 3) combination of mapping and adaptation. We show that at low signal-to-noise ratio (SNR) levels, compensating in the feature and model domains yields similar performance. We also show that adapting the HMMs with the mapped features produces the best performance. The algorithms were implemented using SRI's DECIPHER¿ speech recognition system and were tested on the 1994 ARPA-sponsored CSR evaluation test spoke 10.
Seokyong Moon, University of Washington (USA)
Jenq-Neng Hwang, University of Washington (USA)
The hidden Markov model (HMM) inversion algorithm is proposed and applied to robust speech recognition for general types of mismatched conditions. The Baum-Welch HMM inversion algorithm is a dual procedure to the Baum-Welch HMM reestimation algorithm, which is the most widely used speech recognition technique. The forward training of an HMM, based on the Baum-Welch reestimation, finds the model parameters that optimize some criterion, usually maximum likelihood (ML), with given speech inputs. On the other hand, the inversion of an HMM finds speech inputs that optimize some criterion with given model parameters. The performance of the proposed HMM inversion, in conjunction with HMM reestimation, for robust speech recognition under additive noise corruption and microphone mismatch conditions is favorably compared with other noisy speech recognition techniques, such as the projection-based first-order cepstrum normalization (FOCN) and the robust minimax (MINIMAX) classification techniques.
Keizaburo Takagi, NEC Corporation (JAPAN)
Hiroaki Hattori, NEC Corporation (JAPAN)
Takao Watanabe, NEC Corporation (JAPAN)
This paper proposes a rapid environment adaptation algorithm based on spectrum equalization (REALISE). In practical speech recognition applications, differences between training and testing environments often seriously diminish recognition accuracy. These environmental differences can be classified into two types: difference in additive noise and difference in multiplicative noise in the spectral domain. The proposed method calculates time-alignment between a testing utterance and the closest reference pattern to it, and then calculates the noise differences between the two according to the time-alignment. Then, we adapt all reference patterns to the testing environment using the differences. Finally, the testing utterance is recognized using the adapted reference patterns. In a 250 Japanese word recognition task, in which the training and testing microphones were of two different types, REALISE improved recognition accuracy from 87% to 96%.
H.G. Hirsch, Aachen University of Technology (GERMANY)
C. Erlicher, Aachen University of Technology (GERMANY)
Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals. No explicit speech pause detection is required. Past noisy segments of just about 400 ms duration are needed for the estimation. Thus the algorithm is able to quickly adapt to slowly varying noise levels or slowly changing noise spectra. This technique can be combined with a nonlinear spectral subtraction scheme. The ability can be shown to enhance noisy speech and to improve the performance of speech recognition systems. Another application is the realization of a robust voice activity detection.
Devang Naik, Rutgers University (USA)
This paper introduces a new methodology to remove the residual effects of speech from the cepstral mean used for channel normalization. The approach is based on filtering the eigenmodes of speech that are more susceptible to convolutional distortions caused by transmission channels. The filtering of Linear Prediction (LP) poles and their corresponding eigenmodes for a speech segment are investigated when there is a channel mismatch for speaker identification systems. An algorithm based on Pole-filtering has been developed to improve the commonly employed Cepstral Mean Subtraction. Experiments are presented in speaker identification using speech in the TIMIT database and on the San Diego portion of the KING database. The new technique is shown to offer improved recognition accuracy under cross channel scenarios when compared to conventional methods.