1:00, SPEECH-L8.1
EFFICIENT ON-LINE ACOUSTIC ENVIRONMENT ESTIMATION FOR FCDCN IN A CONTINUOUS SPEECH RECOGNITION SYSTEM
J. DROPPO, L. DENG, A. ACERO
There exists a number of cepstral de-noising algorithms which perform quite well when trained and tested under similar acoustic environments, but degrade quickly under mismatched conditions.
We present two key results that make these algorithms practical in real noise environments, with the ability to adapt to different acoustic environments over time. First, we show that it is possible to leverage the existing de-noising computations to estimate the acoustic environment on-line and in real time. Second, we show that it is not necessary to collect large amounts of training data in each environment--clean data with artificial mixing is sufficient.
When this new method is used as a pre-processing stage to a large vocabulary speech recognition system, it can be made robust to a wide variety of acoustic environments. With synthetic training data, we are able to reduce the word error rate by 27%.
1:20, SPEECH-L8.2
ENVIRONMENTAL ADAPTATION BASED ON FIRST ORDER APPROXIMATION
C. CERISARA, L. RIGAZIO, R. BOMAN, J. JUNQUA
In this paper, we propose an algorithm that compensates for both additive and convolutional noise. The goal of this method is to achieve an efficient environmental adaptation to realistic environments both in terms of computation time and memory. The algorithm described in this paper is an extension of an additive noise adaptation algorithm presented in [1]. Experimental results are given on a realistic database recorded in a car. This database is further filtered by a low pass filter to combine additive and channel noise. The proposed adaptation algorithm reduces the error rate by 75 % on this database, when compared to our baseline system without environmental adaptation.
1:40, SPEECH-L8.3
HIERARCHICAL STOCHASTIC FEATURE MATCHING FOR ROBUST SPEECH RECOGNITION
H. JIANG, F. SOONG, C. LEE
In this paper we investigate how to improve the robustness of a
speech recognizer in a noisy,
mismatched environment when only a single or a few test utterances
are available for compensating the mismatch.
A new hierarchical tree-based transformation is proposed to enhance
the conventional stochastic matching algorithm in the cepstral feature space.
The tree-based hierarchical transformation is estimated in two criteria:
i) maximum likelihood (ML) using the current test utterance;
ii) Sequential maximum a posterior (MAP) using the current and previous utterances.
Recognition results obtained using a hands-free database show the proposed feature
compensation is robust. Significant performance improvement has been observed
over the conventional stochastic matching.
2:00, SPEECH-L8.4
MODEL-COMBINATION-BASED ACOUSTIC MAPPING
M. WESTPHAL, A. WAIBEL
We propose a new method for compensating distortions in the speech signal caused by environment changes. The basic method concentrates on additive noise, but can be extended to address also channel and to some extend speaker changes. By combining compensation with adaptation techniques it leads to high error rate reductions for mobile speech applications. Thereby, it is more efficient than adapting the acoustic model of the recognizer and more powerful than simple noise reduction techniques.
2:20, SPEECH-L8.5
RECURSIVE ESTIMATION OF TIME-VARYING ENVIRONMENTS FOR ROBUST SPEECH RECOGNITION
Y. ZHAO, S. WANG, K. YEN
An EM-type of recursive estimation
algorithm is formulated in the DFT domain for joint estimation of
time-varying parameters of distortion channel and additive
noise from online degraded speech. Speech features are estimated from
the posterior estimates of short-time speech power spectra in an
on-the-fly fashion. Experiments
were performed on speaker-independent continuous speech recognition
using features of perceptually based linear prediction cepstral
coefficients, log energy, and temporal regression coefficients.
Speech data were taken from the TIMIT database and were
degraded by simulated time-varying channel and noise.
Experimental results showed significant improvement in recognition word
accuracy due to the proposed recursive estimation as compared with the
results from direct recognition using a baseline system and from
performing speech feature estimation using a batch EM algorithm.
2:40, SPEECH-L8.6
SEQUENTIAL NOISE ESTIMATION WITH OPTIMAL FORGETTING FOR ROBUST SPEECH RECOGNITION
M. AFIFY, O. SIOHAN
Mismatch is known to degrade the performance of speech recognition
systems. In real life applications mismatch is usually
non-stationary, and a general way to compensate for slowly time
varying mismatch is by using sequential algorithms with
forgetting. The choice of forgetting factor is usually performed
empirically on some development data, and no optimality criterion is
used. In this paper we introduce a framework for obtaining optimal
forgetting factor. The proposed method is applied in conjunction with
a sequential noise estimation algorithm, but can be extended to
sequential bias or affine transformation estimation. Speech
recognition experiments conducted first under a controlled scenario on
the 5K Wall Street Journal task corrupted by different noise types,
then under a real-life scenario on speech recorded in a noisy car
environment validate the proposed method.