SPEAKER ADAPTATION

Chair: C.H. Lee, AT&T Bell Laboratories (USA)

Home


Batch, Incremental and Instantaneous Adaptation Techniques for Speech Recognition

Authors:

G. Zavaliagkos, Northeastern University
R. Schwartz, BBN Systems and Technologies (USA)
J. Makhoul, BBN Systems and Technologies (USA)

Volume 1, Page 676

Abstract:

We present a framework for Maximum A Posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feasibility of instantaneous adaptation techniques. Instantaneous adaptation attempts to improve recognition on a single sentence, the same sentence that is used to estimate the adaptation. In a nutshell, we show that sizable gains (20-40% reduction in error rate) can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers. The same techniques cut the error rate for recognition of non-native speakers by factors of 2 to 4, bringing their performance much closer to the native speaker performance. We also demonstrate that good improvements in performance (25-30%) are realized when instantaneous adaptation is used for recognition of non-native speakers.

300dpi TIFF Images of pages:

676 677 678 679

Acrobat PDF file of whole paper:

ic950676.pdf

TOP



Speaker Adaptation Using Combined Transformation and Bayesian Methods

Authors:

Vassilios Digalakis, SRI International (USA)
Leonardo Neumeyer, SRI International (USA)

Volume 1, Page 680

Abstract:

The performance and robustness of a speech recognition system can be improved by adapting the speech models to the speaker, the channel and the task. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a large amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we propose a constrained estimation tech- nique for Gaussian mixture densities, and combine it with Baye- sian techniques to improve its asymptotic properties. We evaluate our algorithms on the large- vocabulary Wall Street Journal cor- pus for nonnative speakers of American English. The recognition error rate is comparable to the speaker-independent accuracy achieved for native speakers.

300dpi TIFF Images of pages:

680 681 682 683

Acrobat PDF file of whole paper:

ic950680.pdf

TOP



Rapid Speaker Adaptation Using Model Prediction

Authors:

S. M. Ahadi, Cambridge University (UK)
P. C. Woodland, Cambridge University (UK)

Volume 1, Page 684

Abstract:

A key issue in speaker adaptation is gaining the maximum information from a limited amount of adaptation data. In particular it is important that observations of parameters of (context-dependent) HMMs not occurring in the adaptation data can be updated. In the Regression-based Model Prediction (RMP) approach, sets of speaker-independent linear relationships between different parameters in the HMM set are found from training data. During adaptation, distributions with sufficient adaptation data are used to update the parameters of poorly adapted models using these pre-computed regression-based relationships. The method used Bayesian techniques to combine parameter estimates from different sources. Evaluation on the ARPA Resource Management corpus gave a worthwhile reduction in error rate with just a single adaptation sentence, and that RMP consistently outperforms MAP estimation with the same amount of adaptation data.

300dpi TIFF Images of pages:

684 685 686 687

Acrobat PDF file of whole paper:

ic950684.pdf

TOP



Speaker Adaptation Based on Transfer Vector Field Smoothing Using Maximum a posteriori Probability Estimation

Authors:

Masahiro Tonomura, ATR Interpreting Telecommunications Research Labs (JAPAN)
Tetsuo Kosaka, ATR Interpreting Telecommunications Research Labs (JAPAN)
Shoichi Matsunaga, ATR Interpreting Telecommunications Research Labs (JAPAN)

Volume 1, Page 688

Abstract:

This paper proposes a novel speech adaptation algorithm that enables adaptation even with a small amount of speech data. This is a unified algorithm of two efficient conventional speaker adaptation techniques, which are maximum a posteriori (MAP) estimation and transfer vector field smoothing (VFS). This algorithm is designed to avoid the weaknesses of both MAP and VFS. A higher phoneme recognition performance was obtained by using this algorithm than with individual methods, showing the superiority of the proposed algorithm. The phoneme recognition error rate was reduced from 22.0% to 19.1% using this algorithm for a speaker-independent model with seven adaptation phrases. Furthermore, a priori knowledge concerning speaker characteristics was obtained for this algorithm by generating an initial HMM with the speech of a selected speaker cluster based on speaker similarity. The adaptation using this initial model reduced the phoneme recognition error rate from 22.0% to 17.7%.

300dpi TIFF Images of pages:

688 689 690 691

Acrobat PDF file of whole paper:

ic950688.pdf

TOP



Experiments Using Data Augmentation for Speaker Adaptation

Authors:

Jerome R. Bellegarda, Apple Computer Inc.
Peter V. de Souza, IBM (USA)
David Nahamoo, IBM (USA)
Mukund Padmanabhan, IBM (USA)
Michael A. Picheny, IBM (USA)
Lalit R. Bahl, IBM (USA)

Volume 1, Page 692

Abstract:

Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speaker's space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in [1], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.

300dpi TIFF Images of pages:

692 693 694 695

Acrobat PDF file of whole paper:

ic950692.pdf

TOP



Vector-Field-Smoothed Bayesian Learning for Incremental Speaker Adaptation

Authors:

Jun-ichi Takahashi, NTT Human Interface Laboratories (JAPAN)
Shigeki Sagayama, NTT Human Interface Laboratories (JAPAN)

Volume 1, Page 696

Abstract:

This paper presents a fast and incremental speaker adaptation method called MAP/VFS. This is a basic technique for on-line adaptation which will be important in constructing a practical speech recognition system. The concept is based on combining Maximum a posteriori (MAP) estimation, or in other words Bayesian learning, as an intra-class training with Vector Field Smoothing (VFS) as an inter-class smoothing. Experimental results of speaker adaptation show that the adaptation speed of the incremental MAP can be significantly accelerated by the use of inter-class smoothing of VFS. The recognition performance of MAP can be also improved and stabilized consistently by VFS. >From this result, it is found that fast and word-by-word speaker adaptation can be achieved by a simple processing of MAP/VFS without pooling adaptation training data.

300dpi TIFF Images of pages:

696 697 698 699

Acrobat PDF file of whole paper:

ic950696.pdf

TOP



A Speaker Adaptation Technique Using Linear Regression

Authors:

S.J. Cox, University of East Anglia (UK)

Volume 1, Page 700

Abstract:

A technique for adapting speaker-independent speech recognition models to the voice of a new speaker is presented. The technique is capable of estimating adapted parameters for all the speech models when only a small subset of the recognition vocabulary is spoken by the new speaker. Whereas previous methods have often assumed a transformation between the speaker-independent models and the adapted models, this technique models the relationship between different speech units using linear regression. The regression models are built off-line using the training-set data. At recognition-time, the speech models are adapted using the regression models and the new speaker's data, a procedure which is computationally cheap. Experimental results show a halving of the recognition error-rate when only about 8% of the vocabulary is given as enrollment data, and when half the vocabulary is given, a reduction in the error-rate of 78%.

300dpi TIFF Images of pages:

700 701 702 703

Acrobat PDF file of whole paper:

ic950700.pdf

TOP



Speaker Adaptation Based on Spectral Normalization and Dynamic HMM Parameter Adaptation

Authors:

Ming-Whei Feng, GTE Laboratories Inc. (USA)

Volume 1, Page 704

Abstract:

Speaker adaptation has received a considerable amount of attention in recent years. Most of the previous work focused on techniques which require a certain amount of speech to be collected from the target speaker. This paper presents two speaker adaptation methods, including a feature normalization and a HMM parameter adaptation, developed to improve a speaker- independent HMM-based speech recognition system. The proposed adaptation algorithms are text- independent and do not require target speech collection. By applying the feature normalization, the target speech is normalized to reduce the acoustic inter-speaker and environmental variability. By applying the HMM parameter adaptation, the recognition system parameters are dynamically modified to model the target speech. We carried out recognition experiments to assess the performance, using two different speaker-independent recognizers as the baseline systems: a continuous digit recognizer and a keyword recognition system. The results show that when both adaptation techniques are combined, the word error of the digit recognizer using the TI Connected Digit corpus is reduced by about 30% and the detection error of a keyword recognition system using the Road Rally corpora is reduced by about 40%.

300dpi TIFF Images of pages:

704 705 706 707

Acrobat PDF file of whole paper:

ic950704.pdf

TOP



On-line Bayes Adaptation of SCHMM Parameters for Speech Recognition

Authors:

Qiang Huo, University of Hong Kong (HONG KONG)
Chorkin Chan, University of Hong Kong (HONG KONG)

Volume 1, Page 708

Abstract:

In this paper, on-line adaptation of semi-continuous (or tied mixture) hidden Markov model (SCHMM) is studied. A theoretical formulation of the segmental quasi-Bayes learning of the mixture coefficients in SCHMM for speech recognition is presented. The practical issues related to the use of this algorithm for on-line speaker adaptation are addressed. A pragmatic on-line adaptation approach to combine the long-term adaptation of the mixture coefficients and the short-term adaptation of the mean vectors of the Gaussian mixture components are also proposed. The viability of these techniques are confirmed in a series of comparative experiments using a 26-word English alphabet vocabulary.

300dpi TIFF Images of pages:

708 709 710 711

Acrobat PDF file of whole paper:

ic950708.pdf

TOP



Iterative Self-Learning Speaker and Channel Adaptation under Various Initial Conditions

Authors:

Yunxin Zhao, University of Illinois at Urbana- Champaign (USA)

Volume 1, Page 712

Abstract:

A self-learning adaptation technique is presented which handles the speaker and channel induced spectral variations without enrollment speech. At the acoustic level, the distortion spectral bias is estimated in two steps using the unsupervised maximum likelihood estimation: in the first step, the probability distributions of the speech spectral features are assumed uniform for severely mismatched channels; in the second step, the spectral bias is reestimated assuming Gaussian distributions for the spectral features. At the phone unit level, unsupervised sequential adaptation is performed via Bayesian estimation from the on-line, bias-removed speech data, and iterative adaptation is further performed for dictation applications. Over four 198-sentence test sets, on a continuous speech recognition task with vocabulary size = 853 and grammar perplexity = 105, the achieved largest increase of average word accuracy is from the baseline -0.3% to 85.2%, and the achieved maximum average word accuracy is 89.4% from the baseline 56.5%.

300dpi TIFF Images of pages:

712 713 714 715

Acrobat PDF file of whole paper:

ic950712.pdf

TOP