Session: SPEECH-L8
Time: 1:00 - 3:00, Thursday, May 10, 2001
Location: Room 151
Title: Noise Adaptation for Robust Speech Recognition
Chair: Alex Acero

1:00, SPEECH-L8.1
EFFICIENT ON-LINE ACOUSTIC ENVIRONMENT ESTIMATION FOR FCDCN IN A CONTINUOUS SPEECH RECOGNITION SYSTEM
J. DROPPO, L. DENG, A. ACERO
There exists a number of cepstral de-noising algorithms which perform quite well when trained and tested under similar acoustic environments, but degrade quickly under mismatched conditions. We present two key results that make these algorithms practical in real noise environments, with the ability to adapt to different acoustic environments over time. First, we show that it is possible to leverage the existing de-noising computations to estimate the acoustic environment on-line and in real time. Second, we show that it is not necessary to collect large amounts of training data in each environment--clean data with artificial mixing is sufficient. When this new method is used as a pre-processing stage to a large vocabulary speech recognition system, it can be made robust to a wide variety of acoustic environments. With synthetic training data, we are able to reduce the word error rate by 27%.

1:20, SPEECH-L8.2
ENVIRONMENTAL ADAPTATION BASED ON FIRST ORDER APPROXIMATION
C. CERISARA, L. RIGAZIO, R. BOMAN, J. JUNQUA
In this paper, we propose an algorithm that compensates for both additive and convolutional noise. The goal of this method is to achieve an efficient environmental adaptation to realistic environments both in terms of computation time and memory. The algorithm described in this paper is an extension of an additive noise adaptation algorithm presented in [1]. Experimental results are given on a realistic database recorded in a car. This database is further filtered by a low pass filter to combine additive and channel noise. The proposed adaptation algorithm reduces the error rate by 75 % on this database, when compared to our baseline system without environmental adaptation.

1:40, SPEECH-L8.3
HIERARCHICAL STOCHASTIC FEATURE MATCHING FOR ROBUST SPEECH RECOGNITION
H. JIANG, F. SOONG, C. LEE
In this paper we investigate how to improve the robustness of a speech recognizer in a noisy, mismatched environment when only a single or a few test utterances are available for compensating the mismatch. A new hierarchical tree-based transformation is proposed to enhance the conventional stochastic matching algorithm in the cepstral feature space. The tree-based hierarchical transformation is estimated in two criteria: i) maximum likelihood (ML) using the current test utterance; ii) Sequential maximum a posterior (MAP) using the current and previous utterances. Recognition results obtained using a hands-free database show the proposed feature compensation is robust. Significant performance improvement has been observed over the conventional stochastic matching.

2:00, SPEECH-L8.4
MODEL-COMBINATION-BASED ACOUSTIC MAPPING
M. WESTPHAL, A. WAIBEL
We propose a new method for compensating distortions in the speech signal caused by environment changes. The basic method concentrates on additive noise, but can be extended to address also channel and to some extend speaker changes. By combining compensation with adaptation techniques it leads to high error rate reductions for mobile speech applications. Thereby, it is more efficient than adapting the acoustic model of the recognizer and more powerful than simple noise reduction techniques.

2:20, SPEECH-L8.5
RECURSIVE ESTIMATION OF TIME-VARYING ENVIRONMENTS FOR ROBUST SPEECH RECOGNITION
Y. ZHAO, S. WANG, K. YEN
An EM-type of recursive estimation algorithm is formulated in the DFT domain for joint estimation of time-varying parameters of distortion channel and additive noise from online degraded speech. Speech features are estimated from the posterior estimates of short-time speech power spectra in an on-the-fly fashion. Experiments were performed on speaker-independent continuous speech recognition using features of perceptually based linear prediction cepstral coefficients, log energy, and temporal regression coefficients. Speech data were taken from the TIMIT database and were degraded by simulated time-varying channel and noise. Experimental results showed significant improvement in recognition word accuracy due to the proposed recursive estimation as compared with the results from direct recognition using a baseline system and from performing speech feature estimation using a batch EM algorithm.

2:40, SPEECH-L8.6
SEQUENTIAL NOISE ESTIMATION WITH OPTIMAL FORGETTING FOR ROBUST SPEECH RECOGNITION
M. AFIFY, O. SIOHAN
Mismatch is known to degrade the performance of speech recognition systems. In real life applications mismatch is usually non-stationary, and a general way to compensate for slowly time varying mismatch is by using sequential algorithms with forgetting. The choice of forgetting factor is usually performed empirically on some development data, and no optimality criterion is used. In this paper we introduce a framework for obtaining optimal forgetting factor. The proposed method is applied in conjunction with a sequential noise estimation algorithm, but can be extended to sequential bias or affine transformation estimation. Speech recognition experiments conducted first under a controlled scenario on the 5K Wall Street Journal task corrupted by different noise types, then under a real-life scenario on speech recorded in a noisy car environment validate the proposed method.