Authors:
Dan Povey,
Philip C Woodland,
Page (NA) Paper number 2315
Abstract:
This paper describes the application of a discriminative HMM parameter
estimation technique called Frame Discrimination (FD), to medium and
large vocabulary continuous speech recognition. Previous work showed
that FD training gave better results than maximum mutual information
(MMI) training for small tasks. The use of FD for much larger tasks
required the development of a technique to be able to rapidly find
the most likely set of Gaussians for each frame in the system. Experiments
on the Resource Management and North American Business tasks show that
FD training can give comparable improvements to MMI, but is less computationally
intensive.
Authors:
Francoise Beaufays,
Mitchel Weintraub,
Yochai Konig,
Page (NA) Paper number 1466
Abstract:
This paper describes a new approach to acoustic modeling for large
vocabulary continuous speech recognition (LVCSR) systems. Each phone
is modeled with a large Gaussian mixture model (GMM) whose context-dependent
mixture weights are estimated with a sentence-level discriminative
training criterion. The estimation problem is casted in a neural network
framework, which enables the incorporation of the appropriate constraints
on the mixture weight vectors, and allows a straight-forward training
procedure, based on steepest descent. Experiments conducted on the
Callhome-English and Switchboard databases show a significant improvement
of the acoustic model performance, and a somewhat lesser improvement
with the combined acoustic and language models.
Authors:
Richard C. Rose,
Giuseppe Riccardi,
Page (NA) Paper number 1709
Abstract:
This paper investigates techniques for minimizing the impact of non-speech
sounds on the performance of large vocabulary continuous speech recognition
(LVCSR) systems. An experimental study is presented that investigates
whether the careful manual labeling of disfluency and background events
in conversational speech can be used to provide an additional level
of supervision in training HMM acoustic models and statistical language
models. First, techniques are investigated for encorporating explicitly
labeled disfluency and background events directly into the acoustic
HMM model. Second, phrase--based statistical language models are trained
from utterance transcriptions which include labeled instances of these
events. Finally, it is shown that significant word accuracy and run--time
performance improvements are obtained for both sets of techniques on
a telephone--based spoken language understanding task.
Authors:
Wu Chou,
Wolfgang Reichl,
Page (NA) Paper number 2481
Abstract:
In this paper, an approach of penalized Bayesian information criterion
(pBIC) for decision tree state tying is described. The pBIC is applied
to two important applications. First, it is used as a decision tree
growing criterion in place of the conventional approach of using a
heuristic constant threshold. It is found that original BIC penalty
is too low and will not lead to compact decision tree state tying model.
Based on Wolfe's modification to the asymptotic null distribution,
it is derived that two times BIC penalty should be used for decision
tree state tying based on pBIC. Secondly, pBIC is studied as a model
compression criterion for decision tree state tying based acoustic
modeling. Experimental results on a large vocabulary (Wall Street Journal)
speech recognition task indicate that compact decision tree could be
achieved with almost no loss of the speech recognition performance.
Authors:
Jiayu Li,
Alejandro Murua,
Page (NA) Paper number 1841
Abstract:
A two-dimensional extension of Hidden Markov Models (HMM) is introduced,
aiming at improving the modeling of speech signals. The extended model
(a) focuses on the conditional joint distribution of state durations
given the length of utterances, rather than on state transition probabilities;
(b) extends the dependency of observation densities to current, as
well as neighboring states; and (c) introduces a local averaging procedure
to smooth the outcome associated to transitions from successive states.
A set of efficient iterative algorithms, based on segmental K-means
and Iterative Conditional Modes, for the implementation of the extended
model, is also presented. In applications to the recognition of segmented
digits spoken over the telephone, the extended model achieved about
23% reduction in the recognition error rate, when compared to the performance
of HMMs.
Authors:
Xiaoqiang Luo,
Frederick Jelinek,
Page (NA) Paper number 2044
Abstract:
In state-of-art large vocabulary continuous speech recognition (LVCSR)
systems, HMM state-tying is often used to achieve good balance between
the model resolution and robustness. In this paradigm, tied HMM states
share a single set of parameters and are non-distinguishable. To capture
the fine differences among tied HMM states, the probabilistic classification
of HMM states (PCHMM) is proposed in this paper for LVCSR. In particular,
a distribution from a HMM state to classes is introduced. It is shown
that the state-to-class distribution can be estimated together with
conventional HMM parameters within the EM framework. Compared with
HMM state-tying, probabilistic classification of HMM states makes more
efficient use of model parameters. It also makes the acoustic model
more robust against the possible mismatch or variation between training
and test data. The viability of this approach is verified by the significant
reduction of word error rate (WER) on the Switchboard task.
Authors:
Hywel B Richards, Dragon Systems UK (U.K.)
John S Bridle, Dragon Systems UK (U.K.)
Page (NA) Paper number 1930
Abstract:
This paper introduces a new approach to acoustic-phonetic modelling,
the Hidden Dynamic Model (HDM), which explicitly accounts for the coarticulation
and transitions between neighbouring phones. Inspired by the fact that
speech is really produced by an underlying dynamic system, the HDM
consists of a single vector target per phone in a hidden dynamic space
in which speech trajectories are produced by a simple dynamic system.
The hidden space is mapped to the surface acoustic representation via
a non-linear mapping in the form of a multilayer perceptron (MLP).
Algorithms are presented for training of all the parameters (target
vectors and MLP weights) from segmented and labelled acoustic observations
alone, with no special initialisation. The model captures the dynamic
structure of speech, and appears to aid a speech recognition task based
on the SwitchBoard corpus.
Authors:
Sankar Basu,
Charles A Micchelli,
Peder A Olsen,
Page (NA) Paper number 2066
Abstract:
We consider a parametric family of density functions of the type exp(-|x|^(alpha)/2)
for modeling acoustic feature vectors used in automatic recognition
of speech. The parameter "alpha" is a measure of the impulsiveness
as well as the nongaussian nature of the data. While previous work
has focussed on estimating the mean and the variance of the data here
we attempt to estimate the impulsiveness "alpha" from the data on a
maximum likelihood basis. We show that there is a balance between "alpha"
and the number of data points "N" that must be satisfied before maximum
likelihood estimation is carried out. Numerical experiments are performed
on multidimensional vectors obtained from speech data.
|