SpacerHome

Spacer
Mirror Sites
Spacer
General Information
Spacer
Confernce Schedule
Spacer
Technical Program
Spacer
     Plenary Sessions
Spacer
     Special Sessions
Spacer
     Expert Summaries
Spacer
     Tutorials
Spacer
     Industry Technology Tracks
Spacer
     Technical Sessions
    
By Date
    March 16
    March 17
    March 18
    March 19
    
By Category
    AE     COMM
    DISPS     DSPE
    ESS     IMDSP
    ITT     MMSP
    NNSP     SAM
    SP     SPEC
    SPTM
    
By Author
        A    B    C    D   
        E    F    G    H   
        I    J    K    L   
        M    N    O    P   
        Q    R    S    T   
        U    V    W    X   
        Y    Z   
Spacer
Tutorials
Spacer
Industry Technology Tracks
Spacer
Exhibits
Spacer
Sponsors
Spacer
Registration
Spacer
Coming to Phoenix
Spacer
Call for Papers
Spacer
Author's Kit
Spacer
On-line Review
Spacer
Future Conferences
Spacer
Help

Abstract: Session SP-24

Conference Logo

SP-24.1  

PDF File of Paper Manuscript
Fast Speaker Adaptation Using a priori Knowledge
Roland Kuhn (Speech Technology Laboratory, Panasonic Technologies Inc.), Patrick Nguyen (Institut Eurecom), Jean-Claude Junqua, Robert C Boman, Nancy A Niedzielski, Steven C Fincke, Kenneth L Field, Matteo Contolini (Speech Technology Laboratory, Panasonic Technologies Inc.)

Recently, we presented a radically new class of fast adaptation techniques for speech recognition, based on prior knowledge of speaker variation. To obtain this prior knowledge, one applies a dimensionality reduction technique to T vectors of dimension D derived from T speaker-dependent (SD) models. This offline step yields T basis vectors, the eigenvoices. We constrain the model for new speaker S to be located in the space spanned by the first K eigenvoices. Speaker adaptation involves estimating K eigenvoice coefficients for the new speaker; typically, K is very small compared to original dimension D. Here, we review how to find the eigenvoices, give a maximum-likelihood estimator for the new speaker's eigenvoice coefficients, and summarize mean adaptation experiments carried out on the Isolet database. We present new results which assess the impact on performance of changes in training of the SD models. Finally, we interpret the first few eigenvoices obtained.


SP-24.2  

PDF File of Paper Manuscript
Speaker Adaptation Uing Maximum Likelihood Model Interpolation
Zuoying Wang (EE. Department, Tshinghua Univ., Beijing, China), Feng Liu (Electronic Engineering Department, Tsinghua Univ., Beijing, China)

A speaker adaptation scheme named maximum likelihood model interpolation (MLMI) is proposed. The basic idea of MLMI is to compute the speaker adapted (SA) model of a test speaker by a linear convex combination of a set of speaker dependent (SD) models. Given a set of training speakers, we first calculate the corresponding SD models for each training speaker as well as the speaker-independent (SI) models. Then, the mean vector of the SA model is computed as the weighted sum of the set of the SD mean vectors, while the covariance matrix is the same as that of the SI model. An algorithm to estimate the weight parameters is given which maximizes the likelihood of the SA model given the adaptation data. Experiments show that 3 adaptation sentences can give a signaificant performance improvement. As the number of SD models increases, further improvement can be obtained.


SP-24.3  

PDF File of Paper Manuscript
Speaker Adaptation with All-Pass Transforms
John W McDonough, William J Byrne (Center for Language and Speech Processing, The Johns Hopkins University)

In recent work, a class of transforms were proposed which achieve a remapping of the frequency axis much like conventional vocal tract length normalization. These mappings, known collectively as all-pass transforms (APT), were shown to produce substantial improvements in the performance of a large vocabulary speech recognition system when used to normalize incoming speech prior to recognition. In this application, the most advantageous characteristic of the APT was its cepstral-domain linearity; this linearity makes speaker normalization simple to implement, and provides for the robust estimation of the parameters characterizing individual speakers. In the current work, we exploit the APT to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report reductions in word error rate of 3.7% absolute.


SP-24.4  

PDF File of Paper Manuscript
Improved Methods for Vocal Tract Normalization
Lutz Welling, Stephan Kanthak, Hermann Ney (RWTH Aachen - University of Technology)

This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the models learn the normalization scales of the training speakers. We show that using single Gaussian densities for selecting the normalization scales in training results in lower error rates than using mixture densities. For VTN in recognition, we propose an improvement of the well-known multiple-pass strategy: By using an unnormalized acoustic model for the first recognition pass instead of a normalized model lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The multiple-pass strategy is an efficient method but it is suboptimal because the normalization scale and the word sequence are determined sequentially. We found that for telephone digit string recognition this suboptimality reduces the VTN gain in recognition performance by 30\% relative. On the German spontaneous scheduling task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill the proposed methods for VTN reduce the error rates significantly.


SP-24.5  

PDF File of Paper Manuscript
Rapid Speech Recognizer Adaptation to New Speakers
Vassilis Digalakis (Technical U. of Crete), Sid Berkowitz (Department of Defense), Enrico Bocchieri (AT&T), Costas Boulis (Technical U. of Crete), William Byrne (Johns Hopkins U.), Heather Collier (West Virginia U.), Adrian Corduneanu (U. of Toronto), Ashvin Kannan (Nuance Communications), Sanjeev Khudanpur (Johns Hopkins U.), Ananth Sankar (SRI)

This paper summarizes the work of the ``Rapid Speech Recognizer Adaptation'' team in the workshop held at Johns Hopkins University in the summer of 1998. The project addressed the modeling of dependencies between units of speech with the goal of making more effective use of small amounts of data for speaker adaptation. A variety of methods were investigated and their effectiveness in a rapid adaptation task defined on the SWITCHBOARD conversational speech corpus is reported.


SP-24.6  

PDF File of Paper Manuscript
Tree-Structured Models of Parameter Dependence for Rapid Adaptation in Large Vocabulary Conversational Speech Recognition
Ashvin Kannan (Nuance Communications), Sanjeev P Khudanpur (Johns Hopkins University)

Two models of statistical dependence between acoustic model parameters of a large vocabulary conversational speech recognition (LVCSR) system are investigated for the purpose of rapid speaker- and environment-adaptation from a very small amount of speech: (i) a Gaussian multiscale process governed by a stochastic linear dynamical system on a tree, and (ii) a simple hierarchical tree-structured prior. Both methods permit Bayesian (MAP) estimation of acoustic model parameters without parameter-tying even when no samples are available to independently estimate some parameters due to the limited amount of adaptation data. Modeling methodologies are contrasted, and comparative performance of the two on the Switchboard task is presented under identical test conditions for supervised and unsupervised adaptation with controlled amounts of adaptation speech. Both methods provide significant (1% absolute) gain in accuracy over adaptation methods that do not exploit the dependence between acoustic model parameters.


SP-24.7  

PDF File of Paper Manuscript
Correlation Modeling Of MLLR Transform Biases For Rapid HMM Adaptation To New Speakers
Enrico L Bocchieri (AT&T Research), Vassilis Digalakis (Technical University Of Crete), Adrian Corduneanu (University Of Toronto), Costas Boulis (Technical University Of Crete)

This paper concerns rapid adaptation of hidden Markov model (HMM) based speech recognizers to a new speaker, when only few speech samples (one minute or less) are available from the new speaker. A widely used family of adaptation algorithms defines adaptation as a linearly constrained reestimation of the HMM Gaussians. With few speech data, tight constraints must be introduced, by reducing the number of linear transforms and by specifying certain transform structures (e.g. block diagonal). We hypothesize that under these adaptation conditions, the residual errors of the adapted Gaussian parameters can be represented and corrected by dependency models, as estimated from a training corpus. Thus, after introducing a particular class of linear transforms, we develop correlation models of the transform parameters. In rapid adaptation experiments on the SWITCHBOARD corpus, the proposed algorithm performs better than the transform-constrained adaptation and the adaptation by correlation modeling of the HMM parameters, respectively.


SP-24.8  

PDF File of Paper Manuscript
Speech Recognition in a Reverberant Environment using Matched Filter Array (MFA) Processing and Linguistic-Tree Maximum Likelihood Linear Regression (LT-MLLR) Adaptation.
Prabhu Raghavan, Richard J Renomeron, Chiwei Che, Dong-Suk Yuk, James L Flanagan (CAIP Center, Rutgers University, Piscataway, NJ 08854)

Performance of automatic speech recognition systems trained on close-talking data suffers when used in a distant-talking environment due to the mismatch in training and testing conditions. Microphone array sound capture can reduce some mismatch by removing ambient noise and reverberation but offers insufficient improvement in performance. However, using array signal capture in conjunction with Hidden Markov Model (HMM) adaptation on the clean-speech models can result in improved recognition accuracy. This paper describes an experiment in which the output of an 8-element microphone array system using MFA processing is used for speech recognition with LT-MLLR adaptation. The recognition is done in two passes. In the first pass, an HMM trained on clean data is used to recognize the speech. Using the results of this pass, the HMM model is adapted to the environment using the LT-MLLR algorithm. This adapted model, a product of MFA and LT-MLLR, results in improved recognition performance.


SP-23 SP-25 >


Last Update:  February 4, 1999         Ingo Höntsch
Return to Top of Page