ROBUST SPEECH RECOGNITION

Chair: Richard Stern, Carnegie Mellon University (USA)

Home

Robust Speech Recognition Based on Stochastic Matching

Authors:

Ananth Sankar, SRI International
Chin-Hui Lee, AT&T Bell Laboratories (USA)

Volume 1, Page 121

Abstract:

We present a maximum likelihood (ML) stochastic matching approach to decrease the acoustic mismatch between a test utterance Y and a given set of speech hidden Markov models Lambda_X so as to reduce the recognition performance degradation caused by possible distortions in the test utterance. This mismatch may be reduced in two ways: 1) by an inverse distortion function F_(nu)(.) that maps Y into an utterance X which matches better with the models Lambda_X, and 2) by a model transformation function G_(eta)(.) that maps Lambda_X to the transformed model Lambda_Y which matches better with the utterance Y. The functional form of the transformations depends upon our prior knowledge about the mismatch, and the parameters are estimated along with the recognized string in a maximum likelihood manner using the EM algorithm. Experimental results verify the efficacy of the approach in improving the performance of a continuous speech recognition system in the presence of mismatch due to different transducers and transmission channels.

300dpi TIFF Images of pages:

121 122 123 124

Acrobat PDF file of whole paper:

ic950121.pdf

TOP

On the Robustness of Linear Discriminant Analysis as a Preprocessing Step for Noisy Speech Recognition

Authors:

Olivier Siohan, CRIN-CNRS & INRIA - Lorraine (FRANCE)

Volume 1, Page 125

Abstract:

This paper addresses the problem of speech recognition in a noisy environment by finding a robust speech parametric space. The framework of Linear Discriminant Analysis (LDA) is used to derive an efficient speech parametric space for noisy speech recognition, from a classical static+dynamic MFCC space. We first show that the derived LDA space can lead to a higher discrimination than the MFCC related space, even at low signal--to--noise ratio (SNR). Then, we test the robustness of the LDA space to variations between the training and testing SNR. Experiments are performed on a continuous speech recognition task, where speech is degraded with various noises: Gaussian noise, F16, Lynx helicopter, autobus, hair dryer. It was found that LDA is highly sensitive to SNR variations for white noises (Gaussian, hair dryer), while remaining quite efficient for the others.

300dpi TIFF Images of pages:

125 126 127 128

Acrobat PDF file of whole paper:

ic950125.pdf

TOP

A Maximum Likelihood Procedure for a Universal Adaptation Method Based on HMM Composition

Authors:

Yasuhiro Minami, NTT Human Interface Laboratories (JAPAN)
Sadaoki Furui, NTT Human Interface Laboratories (JAPAN)

Volume 1, Page 129

Abstract:

This paper proposes an adaptation method for universal noise (additive noise and multiplicative distortion) based on the HMM composition (compensation) technique. Although the original HMM composition can be applied only to additive noise, our new method can estimate multiplicative distortion by maximizing the likelihood value. Signal-to-noise ratio is automatically estimated as part of the estimation of multiplicative distortion. Phoneme recognition experiments show that this method improves recognition accuracy for noisy and distorted speech.

300dpi TIFF Images of pages:

129 130 131 132

Acrobat PDF file of whole paper:

ic950129.pdf

TOP

A Fast and Flexible Implementation of Parallel Model Combination

Authors:

M. J. F. Gales, Cambridge University (U.K.)
S. J. Young, Cambridge University (U.K.)

Volume 1, Page 133

Abstract:

In previous papers the use of Parallel Model Combination (PMC) for noise robustness has been described. Various fast implementations have been proposed, though to date in order to compensate all the parameters of a system it has been necessary to perform Gaussian integration. This paper introduces an alternative method that can compensate all the parameters of the recognition system, whilst reducing the computational load of this task. Furthermore, the technique offers an additional degree of flexibility, as it allows the number of components to be chosen and optimised using standard iterative techniques. The new technique is referred to as Data-driven PMC (DPMC). It is evaluated on the Resource Management database, with noise artificially added from the NOISEX-92 database. The performance of DPMC is found to be comparable to PMC, at a far lower computational cost. In complex noise environments, by more accurately modelling the noise source, using multiple components, and then reducing the number of components to the original number a slight improvement in performance is obtained.

300dpi TIFF Images of pages:

133 134 135 136

Acrobat PDF file of whole paper:

ic950133.pdf

TOP

Multivariate-Gaussian-Based Cepstral Normalization for Robust Speech Recognition

Authors:

Pedro J. Moreno, Carnegie Mellon University (USA)
Bhiksha Raj, Carnegie Mellon University (USA)
Evandro Gouve, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)

Volume 1, Page 137

Abstract:

In this paper we introduce a new family of environmental compensation algorithms called Multivariate Gaussian Based Cepstral Normalization (RATZ). RATZ assumes that the effects of unknown noise and filtering on speech features can be compensated by corrections to the mean and variance of components of Gaussian mixtures, and an efficient procedure for estimating the correction factors is provided. The RATZ algorithm can be implemented to work with or without the use of stereo development data that had been simultaneously recorded in the training and testing environments. Blind RATZ partially overcomes the loss of information that would have been provided by stereo training through the use of a more accurate description of how noisy environments affect clean speech. We evaluate the performance of the two RATZ algorithms using the CMU SPHINX-II system on the alphanumeric census database and compare their performance with that of previous environmental-robustness developed at CMU.

300dpi TIFF Images of pages:

137 138 139 140

Acrobat PDF file of whole paper:

ic950137.pdf

TOP

Robust Speech Recognition in Noise Using Adaptation and Mapping Techniques

Authors:

Leonardo Neumeyer, SRI International (USA)
Mitchel Weintraub, SRI International (USA)

Volume 1, Page 141

Abstract:

This paper compares three techniques for recognizing continuous speech in the presence of additive car noise: 1) transforming the noisy acoustic features using a mapping algorithm, 2) adaptation of the Hidden Markov Models (HMMs), and 3) combination of mapping and adaptation. We show that at low signal-to-noise ratio (SNR) levels, compensating in the feature and model domains yields similar performance. We also show that adapting the HMMs with the mapped features produces the best performance. The algorithms were implemented using SRI's DECIPHER¿ speech recognition system and were tested on the 1994 ARPA-sponsored CSR evaluation test spoke 10.

300dpi TIFF Images of pages:

141 142 143 144

Acrobat PDF file of whole paper:

ic950141.pdf

TOP

Noisy Speech Recognition Using Robust Inversion of Hidden Markov Models

Authors:

Seokyong Moon, University of Washington (USA)
Jenq-Neng Hwang, University of Washington (USA)

Volume 1, Page 145

Abstract:

The hidden Markov model (HMM) inversion algorithm is proposed and applied to robust speech recognition for general types of mismatched conditions. The Baum-Welch HMM inversion algorithm is a dual procedure to the Baum-Welch HMM reestimation algorithm, which is the most widely used speech recognition technique. The forward training of an HMM, based on the Baum-Welch reestimation, finds the model parameters that optimize some criterion, usually maximum likelihood (ML), with given speech inputs. On the other hand, the inversion of an HMM finds speech inputs that optimize some criterion with given model parameters. The performance of the proposed HMM inversion, in conjunction with HMM reestimation, for robust speech recognition under additive noise corruption and microphone mismatch conditions is favorably compared with other noisy speech recognition techniques, such as the projection-based first-order cepstrum normalization (FOCN) and the robust minimax (MINIMAX) classification techniques.

300dpi TIFF Images of pages:

145 146 147 148

Acrobat PDF file of whole paper:

ic950145.pdf

TOP

Rapid Environment Adaptation for Robust Speech Recognition

Authors:

Keizaburo Takagi, NEC Corporation (JAPAN)
Hiroaki Hattori, NEC Corporation (JAPAN)
Takao Watanabe, NEC Corporation (JAPAN)

Volume 1, Page 149

Abstract:

This paper proposes a rapid environment adaptation algorithm based on spectrum equalization (REALISE). In practical speech recognition applications, differences between training and testing environments often seriously diminish recognition accuracy. These environmental differences can be classified into two types: difference in additive noise and difference in multiplicative noise in the spectral domain. The proposed method calculates time-alignment between a testing utterance and the closest reference pattern to it, and then calculates the noise differences between the two according to the time-alignment. Then, we adapt all reference patterns to the testing environment using the differences. Finally, the testing utterance is recognized using the adapted reference patterns. In a 250 Japanese word recognition task, in which the training and testing microphones were of two different types, REALISE improved recognition accuracy from 87% to 96%.

300dpi TIFF Images of pages:

149 150 151 152

Acrobat PDF file of whole paper:

ic950149.pdf

TOP

Noise Estimation Techniques for Robust Speech Recognition

Authors:

H.G. Hirsch, Aachen University of Technology (GERMANY)
C. Erlicher, Aachen University of Technology (GERMANY)

Volume 1, Page 153

Abstract:

Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals. No explicit speech pause detection is required. Past noisy segments of just about 400 ms duration are needed for the estimation. Thus the algorithm is able to quickly adapt to slowly varying noise levels or slowly changing noise spectra. This technique can be combined with a nonlinear spectral subtraction scheme. The ability can be shown to enhance noisy speech and to improve the performance of speech recognition systems. Another application is the realization of a robust voice activity detection.

300dpi TIFF Images of pages:

153 154 155 156

Acrobat PDF file of whole paper:

ic950153.pdf

TOP

Pole-Filtered Cepstral Mean Subtraction

Authors:

Devang Naik, Rutgers University (USA)

Volume 1, Page 157

Abstract:

This paper introduces a new methodology to remove the residual effects of speech from the cepstral mean used for channel normalization. The approach is based on filtering the eigenmodes of speech that are more susceptible to convolutional distortions caused by transmission channels. The filtering of Linear Prediction (LP) poles and their corresponding eigenmodes for a speech segment are investigated when there is a channel mismatch for speaker identification systems. An algorithm based on Pole-filtering has been developed to improve the commonly employed Cepstral Mean Subtraction. Experiments are presented in speaker identification using speech in the TIMIT database and on the San Diego portion of the KING database. The new technique is shown to offer improved recognition accuracy under cross channel scenarios when compared to conventional methods.

300dpi TIFF Images of pages:

157 158 159 160

Acrobat PDF file of whole paper:

ic950157.pdf