SP-11.1

Frame Discrimination Training of HMMs for Large Vocabulary Speech Recognition
Dan Povey, Philip C Woodland (Cambridge University Engineering Dept)

This paper describes the application of a discriminative HMM parameter estimation technique called Frame Discrimination (FD), to medium and large vocabulary continuous speech recognition. Previous work showed that FD training gave better results than maximum mutual information (MMI) training for small tasks. The use of FD for much larger tasks required the development of a technique to be able to rapidly find the most likely set of Gaussians for each frame in the system. Experiments on the Resource Management and North American Business tasks show that FD training can give comparable improvements to MMI, but is less computationally intensive.

SP-11.2

Discriminative Mixture Weight Estimation for Large Gaussian Mixture Models
Francoise Beaufays (Speech Technology and Research Laboratory, SRI International, Menlo Park, CA.), Mitchel Weintraub, Yochai Konig

This paper describes a new approach to acoustic modeling for large vocabulary continuous speech recognition (LVCSR) systems. Each phone is modeled with a large Gaussian mixture model (GMM) whose context-dependent mixture weights are estimated with a sentence-level discriminative training criterion. The estimation problem is casted in a neural network framework, which enables the incorporation of the appropriate constraints on the mixture weight vectors, and allows a straight-forward training procedure, based on steepest descent. Experiments conducted on the Callhome-English and Switchboard databases show a significant improvement of the acoustic model performance, and a somewhat lesser improvement with the combined acoustic and language models.

SP-11.3

Modeling disfluency and background events in ASR for a natural language understanding task
Richard C. Rose, Giuseppe Riccardi (AT&T Labs - Research, 180 Park Ave., Florham Park, NJ 07932)

This paper investigates techniques for minimizing the impact of non-speech sounds on the performance of large vocabulary continuous speech recognition (LVCSR) systems. An experimental study is presented that investigates whether the careful manual labeling of disfluency and background events in conversational speech can be used to provide an additional level of supervision in training HMM acoustic models and statistical language models. First, techniques are investigated for encorporating explicitly labeled disfluency and background events directly into the acoustic HMM model. Second, phrase--based statistical language models are trained from utterance transcriptions which include labeled instances of these events. Finally, it is shown that significant word accuracy and run--time performance improvements are obtained for both sets of techniques on a telephone--based spoken language understanding task.

SP-11.4

Decision Tree State Tying Based on Penalized Bayesian Information Criterion
WU CHOU, Wolfgang Reichl (Bell Labs., Lucent Technologies)

In this paper, an approach of penalized Bayesian information criterion (pBIC) for decision tree state tying is described. The pBIC is applied to two important applications. First, it is used as a decision tree growing criterion in place of the conventional approach of using a heuristic constant threshold. It is found that original BIC penalty is too low and will not lead to compact decision tree state tying model. Based on Wolfe's modification to the asymptotic null distribution, it is derived that two times BIC penalty should be used for decision tree state tying based on pBIC. Secondly, pBIC is studied as a model compression criterion for decision tree state tying based acoustic modeling. Experimental results on a large vocabulary (Wall Street Journal) speech recognition task indicate that compact decision tree could be achieved with almost no loss of the speech recognition performance.

SP-11.5

A 2D Extended HMM for Speech Recognition
Jiayu Li (Department of Statistics, The University of Chicago), Alejandro Murua (Department of Statistics, University of Washington)

A two-dimensional extension of Hidden Markov Models (HMM) is introduced, aiming at improving the modeling of speech signals. The extended model (a) focuses on the conditional joint distribution of state durations given the length of utterances, rather than on state transition probabilities; (b) extends the dependency of observation densities to current, as well as neighboring states; and (c) introduces a local averaging procedure to smooth the outcome associated to transitions from successive states. A set of efficient iterative algorithms, based on segmental K-means and Iterative Conditional Modes, for the implementation of the extended model, is also presented. In applications to the recognition of segmented digits spoken over the telephone, the extended model achieved about 23% reduction in the recognition error rate, when compared to the performance of HMMs.

SP-11.6

Probabilistic Classification of HMM States for Large Vocabulary Continuous Speech Recognition
Xiaoqiang Luo, Frederick Jelinek (CLSP, The Johns Hopkins University)

In state-of-art large vocabulary continuous speech recognition (LVCSR) systems, HMM state-tying is often used to achieve good balance between the model resolution and robustness. In this paradigm, tied HMM states share a single set of parameters and are non-distinguishable. To capture the fine differences among tied HMM states, the probabilistic classification of HMM states (PCHMM) is proposed in this paper for LVCSR. In particular, a distribution from a HMM state to classes is introduced. It is shown that the state-to-class distribution can be estimated together with conventional HMM parameters within the EM framework. Compared with HMM state-tying, probabilistic classification of HMM states makes more efficient use of model parameters. It also makes the acoustic model more robust against the possible mismatch or variation between training and test data. The viability of this approach is verified by the significant reduction of word error rate (WER) on the Switchboard task.

SP-11.7

The HDM: A Segmental Hidden Dynamic Model of Coarticulation
Hywel B Richards, John S Bridle (Dragon Systems UK)

This paper introduces a new approach to acoustic-phonetic modelling, the Hidden Dynamic Model (HDM), which explicitly accounts for the coarticulation and transitions between neighbouring phones. Inspired by the fact that speech is really produced by an underlying dynamic system, the HDM consists of a single vector target per phone in a hidden dynamic space in which speech trajectories are produced by a simple dynamic system. The hidden space is mapped to the surface acoustic representation via a non-linear mapping in the form of a multilayer perceptron (MLP). Algorithms are presented for training of all the parameters (target vectors and MLP weights) from segmented and labelled acoustic observations alone, with no special initialisation. The model captures the dynamic structure of speech, and appears to aid a speech recognition task based on the SwitchBoard corpus.

SP-11.8

Maximum Likelihood Estimates for Exponential Type Density Families
Sankar Basu, Charles A Micchelli, Peder A Olsen (IBM)

We consider a parametric family of density functions of the type exp(-|x|^alpha) for modeling acoustic feature vectors used in automatic recognition of speech. The parameter "alpha" is a measure of the impulsiveness as well as the nongaussian nature of the data. While previous work has focussed on estimating the mean and the variance of the data here we attempt to estimate the impulsiveness "alpha" from the data on a maximum likelihood basis. We show that there is a balance between "alpha" and the number of data points "N" that must be satisfied before maximum likelihood estimation is carried out. Numerical experiments are performed on multidimensional vectors obtained from speech data.

< SP-10 SP-12 >

Last Update: February 4, 1999 Ingo Höntsch