Session: SPEECH-L5
Time: 3:30 - 5:30, Wednesday, May 9, 2001
Location: Room 151
Title: Acoustic Adaptation Normalization 2
Chair: Ananth Sankar

3:30, SPEECH-L5.1
ANISOTROPIC MAP DEFINED BY EIGENVOICES FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
H. BOTTERWECK
A general method is examined, which unifies the eigenvoice approach and MAP adaptation. The a priori distribution for MAP is chosen to be anisotropic with the eigenvoices as preferred directions while still allowing adaptation into all other directions. This allows the exploitation of a priori knowledge about typical speaker variability within the MAP framework. This approach has two advantages: long term adaptation leads to the same good results as the MAP method, whereas for ultra-short adaptation in the range of 1--2 seconds an overfitting as for maximum likelihood techniques is avoided. The method is applied to large vocabulary continuous speech recognition. Results are to be compared with our recent transfer of the maximum likelihood eigenvoice method to LVCSR. Even after only one recognized word significant improvements of the WER of up to 6% relative are observed for gender independent recognition. 14% improvement are obtained after 5 seconds.

3:50, SPEECH-L5.2
CONFIDENCE-MEASURE-DRIVEN UNSUPERVISED INCREMENTAL ADAPTATION FOR HMM-BASED SPEECH RECOGNITION
D. CHARLET
In this work, we first review the usual ways to take into account confidence measure in unsupervised adaptation and then propose a new unsupervised incremental adaptation based on a ranking of the adaptation data according to their confidence measures. A semi-supervised adaptation process is also proposed: confidence measure is used to select the main part of the data for unsupervised adaptation and the remaining small part of the data is handled in a supervised mode. Experiments are conducted on a field database. Generic context-dependent phoneme HMMs are adapted to task- and field-specific conditions. These experiments show a significant improvement for unsupervised adaptation when confidence measure are used. We also show that the adaptation rate (that measures how important adaptation data are considered with respect to prior data) influences a lot the efficiency of the confidence measure in unsupervised adaptation.

4:10, SPEECH-L5.3
MULTIPLE-CLUSTER ADAPTIVE TRAINING SCHEMES
M. GALES
This paper examines the training of multiple-cluster systems using adaptive training schemes. Various forms of transformation and canonical model are described in a consistent framework allowing re-estimation formulae for all cases to be simply derived. Initial experiments using these various schemes on a large vocabulary speech recognition task are presented. The initial experiments indicate that to achieve best performance when adapting these multiple-cluster systems requires the use of adaptive training schemes rather than using simpler cluster initialisation schemes.

4:30, SPEECH-L5.4
NEURAL-NETWORK-BASED HMM ADAPTATION FOR NOISY SPEECH
S. FURUI, D. ITOH
This paper proposes a new method, using neural networks, of adapting phone HMMs to noisy speech. The neural networks are designed to map clean speech HMMs to noise-adapted HMMs, using noise HMMs and signal-to-noise ratios (SNRs) as inputs, and are trained to minimize the mean square error be-tween the output HMMs and the target noise-adapted HMMs. In evaluation, the proposed method was used to recognize noisy broadcast-news speech in speaker-dependent and speaker-independent modes. The trained networks were confirmed to be effective in recognizing new speakers under new noise and various SNR conditions.

4:50, SPEECH-L5.5
SPEAKER COMPENSATION WITH SINE-LOG ALL-PASS TRANSFORMS
J. MCDONOUGH, F. METZE, H. SOLTAU, A. WAIBEL
In recent work, we proposed the rational all-pass transform (RAPT) as the basis for a speaker adaptation scheme intended for use with a large vocabulary speech recognition system. It was shown that RAPT-based adaptation reduces to a linear transformation of cepstral means, much like the better-known maximum likelihood linear regression (MLLR). In a set of speech recognition experiments conducted on the Switchboard Corpus, we obtained a word error rate (WER) of 37.9% using RAPT adaptation, a significant improvement over the 39.5% achieved with MLLR. In the present work, we propose the sine-log all-pass transform (SLAPT) as a replacement for the RAPT. Our findings indicate the SLAPT is just as effective as the RAPT at reducing WER when used as the basis for a variety of speaker compensation schemes, and in addition conduces to far more tractable computation of transformed cepstral sequences, and the estimation of optimal transform parameters.

5:10, SPEECH-L5.6
VERY FAST ADAPTATION WITH A COMPACT CONTEXT-DEPENDENT EIGENVOICE MODEL
R. KUHN, F. PERRONNIN, P. NGUYEN, J. JUNQUA, L. RIGAZIO
The ``eigenvoice'' technique achieves rapid speaker adaptation by employing prior knowledge of speaker space to place strong constraints on the initial model for each new speaker; it has recently been shown to achieve good performance in a large-vocabulary system. This paper describes a new way of applying the eigenvoice technique to context-dependent acoustic modeling, called the ``Eigencentroid plus Delta Trees'' (EDT) model. Here, the context-dependent model is defined so that it consists of a speaker-dependent component with a small number of parameters linked to a speaker-independent component with far more parameters. The eigenvoice technique can be applied to the speaker-dependent component to attain very fast adaptation of the entire context-dependent model (e.g., 10% relative reduction in error rate after 3 sentences). EDT requires only a small number of parameters to represent speaker space and works even if only a small amount of data is available per training speaker.