Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-24 |
|
SP-24.1
|
Fast Speaker Adaptation Using a priori Knowledge
Roland Kuhn (Speech Technology Laboratory, Panasonic Technologies Inc.),
Patrick Nguyen (Institut Eurecom),
Jean-Claude Junqua,
Robert C Boman,
Nancy A Niedzielski,
Steven C Fincke,
Kenneth L Field,
Matteo Contolini (Speech Technology Laboratory, Panasonic Technologies Inc.)
Recently, we presented a radically new class of fast adaptation techniques for speech recognition,
based on prior knowledge of speaker variation.
To obtain this prior knowledge, one applies
a dimensionality reduction technique to T vectors of dimension
D derived from T speaker-dependent (SD) models. This offline step yields T basis vectors,
the eigenvoices. We constrain the model for new speaker S to be located in the space spanned
by the first K eigenvoices. Speaker adaptation involves estimating K eigenvoice
coefficients for the new speaker; typically, K is very small compared to original dimension D.
Here, we review how to find the eigenvoices, give a maximum-likelihood estimator for
the new speaker's eigenvoice coefficients, and summarize mean adaptation experiments
carried out on the Isolet database. We present new results which assess the impact on
performance of changes in training of the SD models.
Finally, we interpret the first few eigenvoices obtained.
|
SP-24.2
|
Speaker Adaptation Uing Maximum Likelihood Model Interpolation
Zuoying Wang (EE. Department, Tshinghua Univ., Beijing, China),
Feng Liu (Electronic Engineering Department, Tsinghua Univ., Beijing, China)
A speaker adaptation scheme named maximum likelihood model interpolation (MLMI) is proposed. The basic idea of MLMI is to compute the speaker adapted (SA) model of a test speaker by a linear convex combination of a set of speaker dependent (SD) models. Given a set of training speakers, we first calculate the corresponding SD models for each training speaker as well as the speaker-independent (SI) models. Then, the mean vector of the SA model is computed as the weighted sum of the set of the SD mean vectors, while the covariance matrix is the same as that of the SI model. An algorithm to estimate the weight parameters is given which maximizes the likelihood of the SA model given the adaptation data. Experiments show that 3 adaptation sentences can give a signaificant performance improvement. As the number of SD models increases, further improvement can be obtained.
|
SP-24.3
|
Speaker Adaptation with All-Pass Transforms
John W McDonough,
William J Byrne (Center for Language and Speech Processing, The Johns Hopkins University)
In recent work, a class of transforms were proposed which achieve a
remapping of the frequency axis much like conventional vocal tract
length normalization. These mappings, known collectively as
all-pass transforms (APT), were shown to produce substantial
improvements in the performance of a large vocabulary speech
recognition system when used to normalize incoming speech prior to
recognition. In this application, the most advantageous characteristic
of the APT was its cepstral-domain linearity; this linearity makes
speaker normalization simple to implement, and provides for the robust
estimation of the parameters characterizing individual speakers. In
the current work, we exploit the APT to develop a speaker
adaptation scheme in which the cepstral means of a speech recognition
model are transformed to better match the speech of a given
speaker. In a set of speech recognition experiments conducted on the
Switchboard Corpus, we report reductions in word error rate of 3.7%
absolute.
|
SP-24.4
|
Improved Methods for Vocal Tract Normalization
Lutz Welling,
Stephan Kanthak,
Hermann Ney (RWTH Aachen - University of Technology)
This paper presents improved methods for vocal tract normalization (VTN) along with
experimental tests on three databases.
We propose a new method for VTN in training:
By using acoustic models with single Gaussian densities per state
for selecting the normalization scales it is avoided that
the models learn the normalization scales of the training speakers.
We show that using single Gaussian densities for selecting the normalization scales
in training results in lower error rates than using mixture densities.
For VTN in recognition, we propose an improvement of
the well-known multiple-pass strategy: By using
an unnormalized acoustic model for the first recognition pass
instead of a normalized model lower error
rates are obtained. In recognition tests, this method is compared with a fast
variant of VTN.
The multiple-pass strategy is an efficient method but it is
suboptimal because the normalization scale and the word
sequence are determined sequentially. We found that for telephone digit string recognition
this suboptimality reduces the VTN gain in recognition performance by 30\% relative.
On the German spontaneous scheduling task Verbmobil, the
WSJ task and the German telephone digit string corpus SieTill
the proposed methods for VTN reduce the error rates significantly.
|
SP-24.5
|
Rapid Speech Recognizer Adaptation to New Speakers
Vassilis Digalakis (Technical U. of Crete),
Sid Berkowitz (Department of Defense),
Enrico Bocchieri (AT&T),
Costas Boulis (Technical U. of Crete),
William Byrne (Johns Hopkins U.),
Heather Collier (West Virginia U.),
Adrian Corduneanu (U. of Toronto),
Ashvin Kannan (Nuance Communications),
Sanjeev Khudanpur (Johns Hopkins U.),
Ananth Sankar (SRI)
This paper summarizes the work of the ``Rapid Speech Recognizer
Adaptation'' team in the workshop held at Johns Hopkins University in
the summer of 1998. The project addressed the modeling of
dependencies between units of speech with the goal of making more
effective use of small amounts of data for speaker adaptation. A
variety of methods were investigated and their effectiveness in a
rapid adaptation task defined on the SWITCHBOARD conversational speech
corpus is reported.
|
SP-24.6
|
Tree-Structured Models of Parameter Dependence for Rapid Adaptation in Large Vocabulary Conversational Speech Recognition
Ashvin Kannan (Nuance Communications),
Sanjeev P Khudanpur (Johns Hopkins University)
Two models of statistical dependence between acoustic model parameters
of a large vocabulary conversational speech recognition (LVCSR) system
are investigated for the purpose of rapid speaker- and
environment-adaptation from a very small amount of speech: (i) a
Gaussian multiscale process governed by a stochastic linear dynamical
system on a tree, and (ii) a simple hierarchical tree-structured
prior. Both methods permit Bayesian (MAP) estimation of acoustic
model parameters without parameter-tying even when no samples are
available to independently estimate some parameters due to the limited
amount of adaptation data. Modeling methodologies are contrasted, and
comparative performance of the two on the Switchboard task is
presented under identical test conditions for supervised and
unsupervised adaptation with controlled amounts of adaptation speech.
Both methods provide significant (1% absolute) gain in accuracy over
adaptation methods that do not exploit the dependence between acoustic
model parameters.
|
SP-24.7
|
Correlation Modeling Of MLLR Transform Biases For Rapid HMM Adaptation To New Speakers
Enrico L Bocchieri (AT&T Research),
Vassilis Digalakis (Technical University Of Crete),
Adrian Corduneanu (University Of Toronto),
Costas Boulis (Technical University Of Crete)
This paper concerns rapid adaptation of
hidden Markov model (HMM) based speech
recognizers to a new speaker, when only
few speech samples (one minute or less)
are available from the new speaker.
A widely used family of adaptation algorithms
defines adaptation as a linearly constrained
reestimation of the HMM Gaussians.
With few speech data, tight constraints must
be introduced, by reducing the number of linear
transforms and by specifying certain transform
structures (e.g. block diagonal).
We hypothesize that under these adaptation
conditions, the residual errors
of the adapted Gaussian parameters can be
represented and corrected by dependency models,
as estimated from a training corpus.
Thus, after introducing a particular class of
linear transforms, we develop correlation models
of the transform parameters. In rapid adaptation
experiments on the SWITCHBOARD corpus,
the proposed algorithm performs better
than the transform-constrained adaptation and
the adaptation by correlation modeling of the HMM
parameters, respectively.
|
SP-24.8
|
Speech Recognition in a Reverberant Environment using Matched Filter Array (MFA) Processing and Linguistic-Tree Maximum Likelihood Linear Regression (LT-MLLR) Adaptation.
Prabhu Raghavan,
Richard J Renomeron,
Chiwei Che,
Dong-Suk Yuk,
James L Flanagan (CAIP Center, Rutgers University, Piscataway, NJ 08854)
Performance of automatic speech recognition systems trained on
close-talking data suffers when used in a distant-talking environment
due to the mismatch in training and testing conditions. Microphone
array sound capture can reduce some mismatch by removing ambient noise
and reverberation but offers insufficient improvement in
performance. However, using array signal capture in conjunction with Hidden
Markov Model (HMM) adaptation on the clean-speech models can result in
improved recognition accuracy. This paper describes an experiment in
which the output of an 8-element microphone array system using MFA
processing is used for speech recognition with LT-MLLR adaptation.
The recognition is done in two passes. In the first pass, an HMM
trained on clean data is used to recognize the speech. Using the
results of this pass, the HMM model is adapted to the environment
using the LT-MLLR algorithm. This adapted model, a product of MFA and
LT-MLLR, results in improved recognition performance.
|
|