Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-11 |
|
SP-11.1
|
Frame Discrimination Training of HMMs for Large Vocabulary Speech Recognition
Dan Povey,
Philip C Woodland (Cambridge University Engineering Dept)
This paper describes the application of a discriminative HMM parameter
estimation technique called Frame Discrimination (FD), to medium and
large vocabulary continuous speech recognition. Previous work showed
that FD training gave better results than maximum mutual information
(MMI) training for small tasks. The use of FD for much larger tasks
required the development of a technique to be able to rapidly find the
most likely set of Gaussians for each frame in the system.
Experiments on the Resource Management and North American Business tasks
show that FD training can give comparable improvements to MMI, but is
less computationally intensive.
|
SP-11.2
|
Discriminative Mixture Weight Estimation for Large Gaussian Mixture Models
Francoise Beaufays (Speech Technology and Research Laboratory, SRI International, Menlo Park, CA.),
Mitchel Weintraub,
Yochai Konig
This paper describes a new approach to acoustic modeling for large
vocabulary continuous speech recognition (LVCSR) systems. Each phone
is modeled with a large Gaussian mixture model (GMM) whose
context-dependent mixture weights are estimated with a sentence-level
discriminative training criterion. The estimation problem is casted in
a neural network framework, which enables the incorporation of the
appropriate constraints on the mixture weight vectors, and allows
a straight-forward training procedure, based on steepest descent.
Experiments conducted on the Callhome-English and Switchboard
databases show a significant improvement of the acoustic model
performance, and a somewhat lesser improvement with the combined
acoustic and language models.
|
SP-11.3
|
Modeling disfluency and background events in ASR for a natural language understanding task
Richard C. Rose,
Giuseppe Riccardi (AT&T Labs - Research, 180 Park Ave., Florham Park, NJ 07932)
This paper investigates techniques for minimizing the impact of
non-speech sounds on the performance of
large vocabulary continuous speech recognition (LVCSR) systems.
An experimental study is presented that investigates whether
the careful manual labeling of disfluency and background events in conversational
speech can be used to provide an additional level of supervision in
training HMM acoustic models and statistical language models.
First, techniques are investigated for encorporating explicitly labeled
disfluency and background events directly into the acoustic HMM
model.
Second, phrase--based statistical language models are trained from
utterance transcriptions which include labeled instances of these events.
Finally, it is shown that significant word accuracy and run--time
performance improvements are obtained for both sets of techniques
on a telephone--based spoken language understanding task.
|
SP-11.4
|
Decision Tree State Tying Based on Penalized Bayesian Information Criterion
WU CHOU,
Wolfgang Reichl (Bell Labs., Lucent Technologies)
In this paper, an approach of penalized Bayesian information criterion (pBIC)
for decision tree state tying is described. The pBIC is applied to two important
applications. First, it is used as a decision tree growing criterion in place of
the conventional approach of using a heuristic constant threshold. It is found
that original BIC penalty is too low and will not lead to compact decision tree
state tying model. Based on Wolfe's modification to the asymptotic null distribution,
it is derived that two times BIC penalty should be used for decision tree state tying
based on pBIC. Secondly, pBIC is studied as a model compression criterion for decision
tree state tying based acoustic modeling. Experimental results on a large vocabulary
(Wall Street Journal) speech recognition task indicate that compact decision tree could be achieved with almost no loss of the
speech recognition performance.
|
SP-11.5
|
A 2D Extended HMM for Speech Recognition
Jiayu Li (Department of Statistics, The University of Chicago),
Alejandro Murua (Department of Statistics, University of Washington)
A two-dimensional extension of Hidden Markov Models (HMM) is introduced, aiming at improving the modeling of speech signals. The extended model (a) focuses on the conditional joint distribution of state durations given the length of utterances, rather than on state transition probabilities; (b) extends the dependency of observation densities to current, as well as neighboring states; and (c) introduces a local averaging procedure to smooth the outcome associated to transitions from successive states. A set of efficient iterative algorithms, based on segmental K-means and Iterative Conditional Modes, for the implementation of the extended model, is also presented. In applications to the recognition of segmented digits spoken over the telephone, the extended model achieved about 23% reduction in the recognition error rate, when compared to the performance of HMMs.
|
SP-11.6
|
Probabilistic Classification of HMM States for
Large Vocabulary Continuous Speech Recognition
Xiaoqiang Luo,
Frederick Jelinek (CLSP, The Johns Hopkins University)
In state-of-art large vocabulary continuous speech
recognition (LVCSR) systems, HMM state-tying is
often used to achieve good balance between the model
resolution and robustness. In this paradigm, tied
HMM states share a single set of parameters and are
non-distinguishable. To capture the fine differences
among tied HMM states, the probabilistic classification
of HMM states (PCHMM) is proposed in this paper for
LVCSR. In particular, a distribution from a HMM state
to classes is introduced. It is shown that the
state-to-class distribution can be estimated together
with conventional HMM parameters within the EM
framework. Compared with HMM state-tying, probabilistic
classification of HMM states makes more efficient use
of model parameters. It also makes the acoustic model
more robust against the possible mismatch or variation
between training and test data. The viability of this
approach is verified by the significant reduction of
word error rate (WER) on the Switchboard task.
|
SP-11.7
|
The HDM: A Segmental Hidden Dynamic Model of Coarticulation
Hywel B Richards,
John S Bridle (Dragon Systems UK)
This paper introduces a new approach to acoustic-phonetic modelling,
the Hidden Dynamic Model (HDM), which explicitly accounts for the
coarticulation and transitions between neighbouring phones.
Inspired by the fact that speech is really produced by an underlying
dynamic system, the HDM consists of a single vector target per phone
in a hidden dynamic space in which speech trajectories are produced by
a simple dynamic system.
The hidden space is mapped to the surface acoustic representation via
a non-linear mapping in the form of a multilayer perceptron (MLP).
Algorithms are presented for training of all the parameters
(target vectors and MLP weights) from segmented and labelled acoustic
observations alone, with no special initialisation.
The model captures the dynamic structure of speech, and
appears to aid a speech recognition task based on the SwitchBoard corpus.
|
SP-11.8
|
Maximum Likelihood Estimates for Exponential Type Density Families
Sankar Basu,
Charles A Micchelli,
Peder A Olsen (IBM)
We consider a parametric family of density functions
of the type exp(-|x|^alpha) for modeling acoustic
feature vectors used in automatic recognition of
speech. The parameter "alpha" is a measure of the
impulsiveness as well as the nongaussian nature of
the data. While previous work has focussed on
estimating the mean and the variance of the data here
we attempt to estimate the impulsiveness "alpha" from
the data on a maximum likelihood basis. We show that
there is a balance between "alpha" and the number of
data points "N" that must be satisfied before maximum
likelihood estimation is carried out. Numerical
experiments are performed on multidimensional vectors
obtained from speech data.
|
|