Chair: C.H. Lee, AT&T Bell Laboratories (USA)
G. Zavaliagkos, Northeastern University
R. Schwartz, BBN Systems and Technologies (USA)
J. Makhoul, BBN Systems and Technologies (USA)
We present a framework for Maximum A Posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feasibility of instantaneous adaptation techniques. Instantaneous adaptation attempts to improve recognition on a single sentence, the same sentence that is used to estimate the adaptation. In a nutshell, we show that sizable gains (20-40% reduction in error rate) can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers. The same techniques cut the error rate for recognition of non-native speakers by factors of 2 to 4, bringing their performance much closer to the native speaker performance. We also demonstrate that good improvements in performance (25-30%) are realized when instantaneous adaptation is used for recognition of non-native speakers.
Vassilios Digalakis, SRI International (USA)
Leonardo Neumeyer, SRI International (USA)
The performance and robustness of a speech recognition system can be improved by adapting the speech models to the speaker, the channel and the task. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a large amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we propose a constrained estimation tech- nique for Gaussian mixture densities, and combine it with Baye- sian techniques to improve its asymptotic properties. We evaluate our algorithms on the large- vocabulary Wall Street Journal cor- pus for nonnative speakers of American English. The recognition error rate is comparable to the speaker-independent accuracy achieved for native speakers.
S. M. Ahadi, Cambridge University (UK)
P. C. Woodland, Cambridge University (UK)
A key issue in speaker adaptation is gaining the maximum information from a limited amount of adaptation data. In particular it is important that observations of parameters of (context-dependent) HMMs not occurring in the adaptation data can be updated. In the Regression-based Model Prediction (RMP) approach, sets of speaker-independent linear relationships between different parameters in the HMM set are found from training data. During adaptation, distributions with sufficient adaptation data are used to update the parameters of poorly adapted models using these pre-computed regression-based relationships. The method used Bayesian techniques to combine parameter estimates from different sources. Evaluation on the ARPA Resource Management corpus gave a worthwhile reduction in error rate with just a single adaptation sentence, and that RMP consistently outperforms MAP estimation with the same amount of adaptation data.
Masahiro Tonomura, ATR Interpreting Telecommunications Research Labs (JAPAN)
Tetsuo Kosaka, ATR Interpreting Telecommunications Research Labs (JAPAN)
Shoichi Matsunaga, ATR Interpreting Telecommunications Research Labs (JAPAN)
This paper proposes a novel speech adaptation algorithm that enables adaptation even with a small amount of speech data. This is a unified algorithm of two efficient conventional speaker adaptation techniques, which are maximum a posteriori (MAP) estimation and transfer vector field smoothing (VFS). This algorithm is designed to avoid the weaknesses of both MAP and VFS. A higher phoneme recognition performance was obtained by using this algorithm than with individual methods, showing the superiority of the proposed algorithm. The phoneme recognition error rate was reduced from 22.0% to 19.1% using this algorithm for a speaker-independent model with seven adaptation phrases. Furthermore, a priori knowledge concerning speaker characteristics was obtained for this algorithm by generating an initial HMM with the speech of a selected speaker cluster based on speaker similarity. The adaptation using this initial model reduced the phoneme recognition error rate from 22.0% to 17.7%.
Jerome R. Bellegarda, Apple Computer Inc.
Peter V. de Souza, IBM (USA)
David Nahamoo, IBM (USA)
Mukund Padmanabhan, IBM (USA)
Michael A. Picheny, IBM (USA)
Lalit R. Bahl, IBM (USA)
Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speaker's space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in [1], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.
Jun-ichi Takahashi, NTT Human Interface Laboratories (JAPAN)
Shigeki Sagayama, NTT Human Interface Laboratories (JAPAN)
This paper presents a fast and incremental speaker adaptation method called MAP/VFS. This is a basic technique for on-line adaptation which will be important in constructing a practical speech recognition system. The concept is based on combining Maximum a posteriori (MAP) estimation, or in other words Bayesian learning, as an intra-class training with Vector Field Smoothing (VFS) as an inter-class smoothing. Experimental results of speaker adaptation show that the adaptation speed of the incremental MAP can be significantly accelerated by the use of inter-class smoothing of VFS. The recognition performance of MAP can be also improved and stabilized consistently by VFS. >From this result, it is found that fast and word-by-word speaker adaptation can be achieved by a simple processing of MAP/VFS without pooling adaptation training data.
S.J. Cox, University of East Anglia (UK)
A technique for adapting speaker-independent speech recognition models to the voice of a new speaker is presented. The technique is capable of estimating adapted parameters for all the speech models when only a small subset of the recognition vocabulary is spoken by the new speaker. Whereas previous methods have often assumed a transformation between the speaker-independent models and the adapted models, this technique models the relationship between different speech units using linear regression. The regression models are built off-line using the training-set data. At recognition-time, the speech models are adapted using the regression models and the new speaker's data, a procedure which is computationally cheap. Experimental results show a halving of the recognition error-rate when only about 8% of the vocabulary is given as enrollment data, and when half the vocabulary is given, a reduction in the error-rate of 78%.
Ming-Whei Feng, GTE Laboratories Inc. (USA)
Speaker adaptation has received a considerable amount of attention in recent years. Most of the previous work focused on techniques which require a certain amount of speech to be collected from the target speaker. This paper presents two speaker adaptation methods, including a feature normalization and a HMM parameter adaptation, developed to improve a speaker- independent HMM-based speech recognition system. The proposed adaptation algorithms are text- independent and do not require target speech collection. By applying the feature normalization, the target speech is normalized to reduce the acoustic inter-speaker and environmental variability. By applying the HMM parameter adaptation, the recognition system parameters are dynamically modified to model the target speech. We carried out recognition experiments to assess the performance, using two different speaker-independent recognizers as the baseline systems: a continuous digit recognizer and a keyword recognition system. The results show that when both adaptation techniques are combined, the word error of the digit recognizer using the TI Connected Digit corpus is reduced by about 30% and the detection error of a keyword recognition system using the Road Rally corpora is reduced by about 40%.
Qiang Huo, University of Hong Kong (HONG KONG)
Chorkin Chan, University of Hong Kong (HONG KONG)
In this paper, on-line adaptation of semi-continuous (or tied mixture) hidden Markov model (SCHMM) is studied. A theoretical formulation of the segmental quasi-Bayes learning of the mixture coefficients in SCHMM for speech recognition is presented. The practical issues related to the use of this algorithm for on-line speaker adaptation are addressed. A pragmatic on-line adaptation approach to combine the long-term adaptation of the mixture coefficients and the short-term adaptation of the mean vectors of the Gaussian mixture components are also proposed. The viability of these techniques are confirmed in a series of comparative experiments using a 26-word English alphabet vocabulary.
Yunxin Zhao, University of Illinois at Urbana- Champaign (USA)
A self-learning adaptation technique is presented which handles the speaker and channel induced spectral variations without enrollment speech. At the acoustic level, the distortion spectral bias is estimated in two steps using the unsupervised maximum likelihood estimation: in the first step, the probability distributions of the speech spectral features are assumed uniform for severely mismatched channels; in the second step, the spectral bias is reestimated assuming Gaussian distributions for the spectral features. At the phone unit level, unsupervised sequential adaptation is performed via Bayesian estimation from the on-line, bias-removed speech data, and iterative adaptation is further performed for dictation applications. Over four 198-sentence test sets, on a continuous speech recognition task with vocabulary size = 853 and grammar perplexity = 105, the achieved largest increase of average word accuracy is from the baseline -0.3% to 85.2%, and the achieved maximum average word accuracy is 89.4% from the baseline 56.5%.