Session: SPEECH-P3
Time: 3:30 - 5:30, Tuesday, May 8, 2001
Location: Exhibit Hall Area 6
Title: Acoustic Adaptaion and Normalization 1
Chair: Mark Gales

3:30, SPEECH-P3.1
DURATION NORMALIZATION FOR IMPROVED RECOGNITION OF SPONTANEOUS AND READ SPEECH VIA MISSING FEATURE METHODS
J. NEDEL, R. STERN
Hidden Markov Models (HMMs) are known to model the duration of sound units poorly. In this paper we present a technique to normalize the duration of each phone to overcome this weakness, with the conjecture that speech with normalized phone durations may be better modeled and discriminated using standard HMM acoustic models. Duration normalization is accomplished by dropping frames if a phone is longer than the desired duration and by adding "missing" frames and reconstructing them if a phone is shorter than the desired duration. If phone segmentations are known a priori, we achieve a 15.8% reduction in relative WER on spontaneous speech and a 10.3% reduction in relative WER on read speech. Preliminary work with automatic phone segmentations derived from the data is also presented.

3:30, SPEECH-P3.2
EIGENSPACE-BASED MAXIMUM A POSTERIORI LINEAR REGRESSION FOR RAPID SPEAKER ADAPTATION
K. CHEN, H. WANG
In this paper, we present an eigenspace-based approach toward prior density selection for the MAPLR framework. The proposed approach was developed by introducing a priori knowledge analysis on the training speakers via probabilistic principal component analysis (PPCA), so as to construct an eigenspace for speaker-specific full regression matrices as well as to derive a set of bases called eigen-matrices. The priors of MAPLR transformations for each outside speaker are then chosen in the space spanned by the first K eigen-matrices. By incorporating the PPCA model into the MAPLR scheme, the number of free parameters in choosing the priors can be effectively reduced, while the underlying structure of the acoustic space as well as the precise modeling of the inter-dimensional correlation among the model parameters can be well preserved. Adaptation experiments showed that the proposed approach significantly outperformed the conventional MLLR approach using either diagonal or full regression matrices.

3:30, SPEECH-P3.3
EMAP-BASED SPEAKER ADAPTATION WITH ROBUST CORRELATION ESTIMATION
E. JON, D. KIM, N. KIM
In this paper, we propose a method to enhance the performance of the extended maximum a posteriori (EMAP) estimation using the probabilistic principal component analysis (PPCA). PPCA is used to robustly estimate the correlation matrix among separate hidden Markov model (HMM) parameters. The correlation matrix is then applied to the EMAP scheme for speaker adaptation. PPCA is efficient to compute, and shows better performance compared to the method previously used for EMAP. Through various experiments on continuous digit recognition, it is shown that the EMAP approach based on the PPCA gives enhanced performance especially for a small amount of adaptation data.

3:30, SPEECH-P3.4
LINEAR FEATURE SPACE PROJECTIONS FOR SPEAKER ADAPTATION
G. SAON, G. ZWEIG, M. PADMANABHAN
We extend the well-known technique of constrained Maximum Likelihood Linear Regression (MLLR) to compute a projection (instead of a full rank transformation) on the feature vectors of the adaptation data. We model the projected features with phone-dependent Gaussian distributions and also model the complement of the projected space with a single class-independent, speaker-specific Gaussian distribution. Subsequently, we compute the projection and its complement using maximum likelihood techniques. The resulting ML transformation is shown to be equivalent to performing a speaker-dependent heteroscedastic discriminant (or HDA) projection. Our method is in contrast to traditional approaches which use a single speaker-independent projection, and do speaker adaptation in the resulting subspace. Experimental results on Switchboard show a 3% relative improvement in the word error rate over constrained MLLR in the projected subspace only.

3:30, SPEECH-P3.5
ONLINE SPEAKER ADAPTATION BASED ON QUASI-BAYES LINEAR REGRESSION
J. CHIEN, C. HUANG
This paper presents an online linear regression adaptation of hidden Markov models (HMM's). Our attempt is to sequentially improve speech recognizer to meet nonstationary environments via linear regression adaptation. A quasi-Bayes linear regression (QBLR) algorithm is developed to execute online adaptation where the regression matrix is estimated using QB theory. In the estimation, we specify the prior density of regression matrix as a matrix variate normal distribution and derive the posterior density belonging to the same distribution family. The optimal regression matrix can be easily calculated. Also, the reproducible prior/posterior density pair provides meaningful mechanism for sequential learning of prior statistics. At each sequential epoch, only updated prior statistics and current observed data are required for adaptation. In general, QBLR can be reduced to maximum likelihood linear regression (MLLR) and maximum a posteriori linear regression (MAPLR). Experiments show that QBLR is effective for speaker adaptation in car environments.

3:30, SPEECH-P3.6
RAPID ADAPTATION USING PENALIZED-LIKELIHOOD METHODS
H. ERDOGAN, Y. GAO, M. PICHENY
In this paper, we introduce new rapid adaptation techniques that extend and improve two successful methods previously introduced, cluster weighting (CW) and MAPLR. First, we introduce a new adaptation scheme called CWB which extends the cluster weighting adaptation method by including a bias term and a reference speaker model. CWB is shown to improve the adaptation performance as compared to CW. Second, we introduce an extension of cluster weighting that uses penalized-likelihood objective functions to stabilize the estimation and provide soft constraints. Third, we propose a variant of MAPLR adaptation that uses prior speaker information. Previously, prior distributions of transforms in MAPLR were obtained using the same adaptation data, speaker independent HMM means or by some heuristics. We propose to use the prior information of speaker variability to obtain the priors, by using CW or CWB weights. Penalized-likelihood or Bayesian theory serves as a tool to combine transformation based and prior speaker information based adaptation methods resulting in effective rapid adaptation techniques. The techniques are shown to outperform full, block diagonal and diagonal MLLR as well as some other recently proposed methods for rapid adaptation.

3:30, SPEECH-P3.7
UNSUPERVISED SPEAKER ADAPTATION BASED ON THE SUFFICIENT HMM STATISTICS OF SELECTED SPEAKERS
S. YOSHIZAWA, A. BABA, K. MATSUNAMI, Y. MERA, M. YAMADA, K. SHIKANO
This paper describes an efficient method of unsupervised speaker adaptation. This method is based on (1) selecting a subset of speakers who are acoustically close to a test speaker, and (2) calculating adapted model parameters according to the previously stored sufficient HMM statistics of the selected speakers' data. In this method, only a few unsupervised test speaker's data are required for the adaptation. Also, by using the sufficient HMM statistics of the selected speakers' data, a quick adaptation can be done. Compared with a pre-clustering method, the proposed method can obtain a more optimal speaker cluster because the clustering result is determined according to test speaker's data on-line. Experiment results show that the proposed method attains better improvement than MLLR from the speaker-independent model. Moreover the proposed method utilizes only one unsupervised sentence utterance, while MLLR usually utilizes more than ten supervised sentence utterances.

3:30, SPEECH-P3.8
RAPID SPEAKER ADAPTATION USING A PRIORI KNOWLEDGE BY EIGENSPACE ANALYSIS OF MLLR PARAMETERS
N. WANG, S. LEE, F. SEIDE, L. LEE
This paper considers the problem of rapid speaker adaptation in speech recognition. In particular, we exploit an approach based on combination of transformations, which utilizes the concepts of both maximum likelihood linear regression (MLLR) and eigenvoice adaptation. We analyze three different possible methods to realize the concept, and formulate a fast algorithm of maximum likelihood coefficient estimation for test speakers. It was found that the best approach can properly utilize the a priori knowledge of speaker-independent models in constructing the eigenspace for speaker characteristics, while using MLLR matrices in representing the specific speakers so as to reduce the on-line memory and computation requirement of the adaptation phase. This best approach leads to identical models as eigenvoice adaptation that is based on MLLR-adapted speaker models. The experimental results and discussions also provide a good analysis towards integration of MLLR and eigenvoice approaches.

3:30, SPEECH-P3.9
TOWARDS NON-STATIONARY MODEL-BASED NOISE ADAPTATION FOR LARGE VOCABULARY SPEECH RECOGNITION
T. KRISTJANSSON, L. DENG, A. ACERO, B. FREY
Recognition rates of speech recognition systems are known to degrade substantially when there is a mismatch between training and deployment environments. One approach to tackling this problem is to transform the acoustic models based on the channel distortion and noise characteristics of the new environment. Currently, most model adaptation strategies assume that the noise characteristics are stationary. We present results for using multiple noise distributions for the Whisper large vocabulary speech recognition system. The Vector Taylor Series method for adaptation of the distributions is used, and either a weighted average of the noise states or the locally best noise states is used. Our results indicate that for certain types of noise, significant gains in recognition accuracy can be achieved.

3:30, SPEECH-P3.10
HYPOTHESIS-DRIVEN ADAPTATION (HYDRA): A FLEXIBLE EIGENVOICE ARCHITECTURE
S. PETERS
In this article, a new architecture for speech recognition is introduced. As with many existing speech systems, this new approach involves multi-pass processing. In the present case, however, second-pass models are constructed on-line for each active hypothesis. Models for each hypothesized segment of the current utterance are constructed from linear combinations of "data cluster models" that have been trained on low-variability clusters of the training corpus. The data cluster weights are determined using an "eigenvoice" mechanism that is operative on low-complexity, low-definition models. Once determined, the same weights are used to construct high-complexity, high-definition second-pass models generated over the same data clusters. Results from a simple recognition task are reported to demonstrate the interesting properties of the new architecture. The limitations, trade-offs and some possible extensions of the proposed approach are discussed.