Authors:
Roland Kuhn,
Patrick Nguyen,
Jean-Claude Junqua,
Robert C Boman,
Nancy A Niedzielski,
Steven C Fincke,
Kenneth L Field,
Matteo Contolini,
Page (NA) Paper number 1587
Abstract:
Recently, we presented a radically new class of fast adaptation techniques
for speech recognition, based on prior knowledge of speaker variation.
To obtain this prior knowledge, one applies a dimensionality reduction
technique to T vectors of dimension D derived from T speaker-dependent
(SD) models. This offline step yields T basis vectors, the eigenvoices.
We constrain the model for new speaker S to be located in the space
spanned by the first K eigenvoices. Speaker adaptation involves estimating
K eigenvoice coefficients for the new speaker; typically, K is very
small compared to original dimension D. Here, we review how to find
the eigenvoices, give a maximum-likelihood estimator for the new speaker's
eigenvoice coefficients, and summarize mean adaptation experiments
carried out on the Isolet database. We present new results which assess
the impact on performance of changes in training of the SD models.
Finally, we interpret the first few eigenvoices obtained.
Authors:
Zuoying Wang, EE. Department, Tshinghua Univ., Beijing, China (China)
Feng Liu, Electronic Engineering Department, Tsinghua Univ., Beijing, China (China)
Page (NA) Paper number 1368
Abstract:
A speaker adaptation scheme named maximum likelihood model interpolation
(MLMI) is proposed. The basic idea of MLMI is to compute the speaker
adapted (SA) model of a test speaker by a linear convex combination
of a set of speaker dependent (SD) models. Given a set of training
speakers, we first calculate the corresponding SD models for each training
speaker as well as the speaker-independent (SI) models. Then, the mean
vector of the SA model is computed as the weighted sum of the set of
the SD mean vectors, while the covariance matrix is the same as that
of the SI model. An algorithm to estimate the weight parameters is
given which maximizes the likelihood of the SA model given the adaptation
data. Experiments show that 3 adaptation sentences can give a signaificant
performance improvement. As the number of SD models increases, further
improvement can be obtained.
Authors:
John W McDonough,
William J Byrne,
Page (NA) Paper number 2093
Abstract:
In recent work, a class of transforms were proposed which achieve a
remapping of the frequency axis much like conventional vocal tract
length normalization. These mappings, known collectively as all-pass
transforms (APT), were shown to produce substantial improvements in
the performance of a large vocabulary speech recognition system when
used to normalize incoming speech prior to recognition. In this application,
the most advantageous characteristic of the APT was its cepstral-domain
linearity; this linearity makes speaker normalization simple to implement,
and provides for the robust estimation of the parameters characterizing
individual speakers. In the current work, we exploit the APT to develop
a speaker adaptation scheme in which the cepstral means of a speech
recognition model are transformed to better match the speech of a given
speaker. In a set of speech recognition experiments conducted on the
Switchboard Corpus, we report reductions in word error rate of 3.7%
absolute.
Authors:
Lutz Welling,
Stephan Kanthak,
Hermann Ney,
Page (NA) Paper number 1436
Abstract:
This paper presents improved methods for vocal tract normalization
(VTN) along with experimental tests on three databases. We propose
a new method for VTN in training: By using acoustic models with single
Gaussian densities per state for selecting the normalization scales
it is avoided that the models learn the normalization scales of the
training speakers. We show that using single Gaussian densities for
selecting the normalization scales in training results in lower error
rates than using mixture densities. For VTN in recognition, we propose
an improvement of the well-known multiple-pass strategy: By using an
unnormalized acoustic model for the first recognition pass instead
of a normalized model lower error rates are obtained. In recognition
tests, this method is compared with a fast variant of VTN. The multiple-pass
strategy is an efficient method but it is suboptimal because the normalization
scale and the word sequence are determined sequentially. We found that
for telephone digit string recognition this suboptimality reduces the
VTN gain in recognition performance by 30% relative. On the German
spontaneous scheduling task Verbmobil, the WSJ task and the German
telephone digit string corpus SieTill the proposed methods for VTN
reduce the error rates significantly.
Authors:
Vassilis Digalakis,
Sid Berkowitz,
Enrico L Bocchieri,
Costas Boulis,
William J Byrne,
Heather Collier,
Adrian Corduneanu,
Ashvin Kannan,
Sanjeev P Khudanpur,
Ananth Sankar,
Page (NA) Paper number 2102
Abstract:
This paper summarizes the work of the ``Rapid Speech Recognizer Adaptation''
team in the workshop held at Johns Hopkins University in the summer
of 1998. The project addressed the modeling of dependencies between
units of speech with the goal of making more effective use of small
amounts of data for speaker adaptation. A variety of methods were investigated
and their effectiveness in a rapid adaptation task defined on the SWITCHBOARD
conversational speech corpus is reported.
Authors:
Ashvin Kannan,
Sanjeev P Khudanpur,
Page (NA) Paper number 2197
Abstract:
Two models of statistical dependence between acoustic model parameters
of a large vocabulary conversational speech recognition (LVCSR) system
are investigated for the purpose of rapid speaker- and environment-adaptation
from a very small amount of speech: (i) a Gaussian multiscale process
governed by a stochastic linear dynamical system on a tree, and (ii)
a simple hierarchical tree-structured prior. Both methods permit Bayesian
(MAP) estimation of acoustic model parameters without parameter-tying
even when no samples are available to independently estimate some parameters
due to the limited amount of adaptation data. Modeling methodologies
are contrasted, and comparative performance of the two on the Switchboard
task is presented under identical test conditions for supervised and
unsupervised adaptation with controlled amounts of adaptation speech.
Both methods provide significant (1% absolute) gain in accuracy over
adaptation methods that do not exploit the dependence between acoustic
model parameters.
Authors:
Enrico L Bocchieri,
Vassilis Digalakis,
Adrian Corduneanu,
Costas Boulis,
Page (NA) Paper number 2343
Abstract:
This paper concerns rapid adaptation of hidden Markov model (HMM) based
speech recognizers to a new speaker, when only few speech samples (one
minute or less) are available from the new speaker. A widely used family
of adaptation algorithms defines adaptation as a linearly constrained
reestimation of the HMM Gaussians. With few speech data, tight constraints
must be introduced, by reducing the number of linear transforms and
by specifying certain transform structures (e.g. block diagonal). We
hypothesize that under these adaptation conditions, the residual errors
of the adapted Gaussian parameters can be represented and corrected
by dependency models, as estimated from a training corpus. Thus, after
introducing a particular class of linear transforms, we develop correlation
models of the transform parameters. In rapid adaptation experiments
on the SWITCHBOARD corpus, the proposed algorithm performs better than
the transform-constrained adaptation and the adaptation by correlation
modeling of the HMM parameters, respectively.
Authors:
Prabhu Raghavan,
Richard J Renomeron,
Chiwei Che,
Dong-Suk Yuk,
James L Flanagan,
Page (NA) Paper number 2002
Abstract:
Performance of automatic speech recognition systems trained on close-talking
data suffers when used in a distant-talking environment due to the
mismatch in training and testing conditions. Microphone array sound
capture can reduce some mismatch by removing ambient noise and reverberation
but offers insufficient improvement in performance. However, using
array signal capture in conjunction with Hidden Markov Model (HMM)
adaptation on the clean-speech models can result in improved recognition
accuracy. This paper describes an experiment in which the output of
an 8-element microphone array system using MFA processing is used for
speech recognition with LT-MLLR adaptation. The recognition is done
in two passes. In the first pass, an HMM trained on clean data is used
to recognize the speech. Using the results of this pass, the HMM model
is adapted to the environment using the LT-MLLR algorithm. This adapted
model, a product of MFA and LT-MLLR, results in improved recognition
performance.
|