SP-4.1

Using a large vocabulary continuous speech recognizer for a constrained domain with limited training
Man-hung Siu, Michael Jonas, Herbert Gish (BBN Technologies/GTE Internetworking)

How to train a speech recognizer with limited amount of training data is of interest to many researcher. In this paper, we describe how we use BBN's Byblos large vocabulary continuous speech recognition (LVCSR) system for the military air-traffic-control domain where we have less than an hour of training data. We investigate three ways to deal with the limited training data: 1) re-configure the LVCSR system to use fewer parameters, 2) incorporate out-of-domain data, and, 3) use pragmatic information, such as speaker identity and controller function to improve recognition performance. We compare the LVCSR performance to that of the tied-mixture recognizer that is designed for limited vocabulary. We show that the reconfigured LVCSR system out-performs the tied-mixture system by 10% in absolute word error rate. When enough data is available per speaker, vocal tract length normalization and supervised adaptation techniques can further improve performance by 6% even for this domain with limited training. We also show that the use of out-of-domain data and pragmatic information, if available, can each further improve performance by 1-3%.

SP-4.2

Initial Evaluation of Hidden Dynamic Models on Conversational Speech
Joseph Picone (Institute for Signal and Information Processing, MS State), Sandi Pike (Brown University), Roland Reagan (Carnegie Mellon University), Terri Kamm (Department of Defense), John Bridle (Dragon Systems U.K.), Li Deng, Jeff Ma (University of Waterloo), Hywel Richards (Dragon Systems U.K.), Mike Schuster (ATR)

Conversational speech recognition is a challenging problem primarily because speakers rarely fully articulate sounds. A successful speech recognition approach must infer intended spectral targets from the speech data, or develop a method of dealing with large variances in the data. Hidden Dynamic Models (HDMs) attempt to automatically learn such targets in a hidden feature space using models that integrate linguistic information with constrained temporal trajectory models. HDMs are a radical departure from conventional hidden Markov models (HMMs), which simply account for variation in the observed data. In this paper, we present an initial evaluation of such models on a conversational speech recognition task involving a subset of the SWITCHBOARD corpus. We show that in an N-Best rescoring paradigm, HDMs are capable of delivering performance competitive with HMMs.

SP-4.3

Convolutional Density Estimation in Hidden Markov Models for Speech Recognition
Spyros Matsoukas, George Zavaliagkos (BBN Technologies, GTE Internetworking)

In continuous density Hidden Markov Models (HMMs) for speech recognition, the probability density function (pdf) for each state is usually expressed as a mixture of Gaussians. In this paper, we present a model in which the pdf is expressed as the convolution of two densities. We focus on the special case where one of the convolved densities is a M-Gaussian mixture and the other is a mixture of N impulses. We present the reestimation formulae for the parameters of the MxN convolutional model, and suggest two ways for initializing them, the residual K-Means approach, and the deconvolution from a standard HMM with MN Gaussians per state using a genetic algorithm to search for the optimal assignment of Gaussians. Both methods result in a compact representation that requires only O(M + N) storage space for the model parameters, and O(MN) time for training and decoding. We explain how the decoding time can be reduced to O(M + kN), where k < M. Finally, results are shown on the 1996 Hub-4 Development test, demonstrating that a 32x2 convolutional model can achieve performance comparable to that of a standard 64-Gaussian per state model.

SP-4.4

AUTOMATIC CLUSTERING AND GENERATION OF CONTEXTUAL QUESTIONS FOR TIED STATES IN HIDDEN MARKOV MODELS
Rita Singh, Bhiksha Raj, Richard M Stern (Carnegie Mellon University)

Most current automatic speech recognition systems based on HMMs cluster or tie together subsets of the subword units with which speech is represented. This tying improves recognition accuracy when systems are trained with limited data, and is performed by classifying the sub-phonetic units using a series of binary tests based on speech production, called "linguistic questions". This paper describes a new method for automatically determining the best combinations of subword units to form these questions. The hybrid algorithm proposed clusters state distributions of context-independent phones to obtain questions for triphonetic contexts. Experiments confirm that the questions thus generated can replace manually generated questions and can provide improved recognition accuracy. Automatic generation of questions has the additional important advantage of extensibility to languages for which the phonetic structure is not well understood by the system designer, and can be effectively used in situations where the subword units are not phonetically motivated.

SP-4.5

Partly Hidden Markov Model and its Application to Speech Recognition
Tetsunori Kobayashi, Junko Furuyama, Ken Masumitsu (Waseda University)

A new pattern matching method, Partly Hidden Markov Model, is proposed and applied to speech recognition. Hidden Markov Model, which is widely used for speech recognition, can deal with only piecewise stationary stochastic process. We solved this problem by introducing the modified second order Markov Model, in which the first state is hidden and the second one is observable. In this model, not only the feature parameter observations but also the state transitions are dependent on the previous feature observation. Therefore, even the compricated transient can be modeled precisely. Some simulational experiments showed the high potential of the proposed model. As the results of word recognition test, the error rate was reduced by 39\% compared with normal HMM.

SP-4.6

HIDDEN MARKOV MODELS WITH DIVERGENCE BASED VECTOR QUANTIZED VARIANCES
Jae H Kim, Raziel Haimi-Cohen, Frank K Soong (Lucent Technologies)

This paper describes a method to significantly reduce the complexity of continuous density HMM with only a small degradation in performance. The proposed method is noise-robust and may perform even better than the standard algorithm if training and testing noise conditions are not matched. The method is based on approximating the variance vectors of the Gaussian kernels by a vector quantization (VQ) codebook of a small size. The quantization of the variance vectors is done using an information theoretic distortion measure. Closed form expressions are given for the computation of the VQ codebook and the superiority of the proposed distortion measure over the Euclidean distance is demonstrated. The effectiveness of the proposed method is shown using the connected TI digits database and a noisy version of it. For the connected TI digit database, the proposed method shows that by quantizing the variance to 16 levels we can maintain recognition performance within 1% degradation of the original VR system. In comparison, with Euclidean distortion, a size 256 codebook is needed for a similar error rate.

SP-4.7

HMM Training Based on Quality Measurement
Yuqing Gao, Ea-Ee Jan, Mukund Padmanabhan, Michael Picheny (IBM T.J. Watson Research Center)

Two discriminant measures for HMM states to improve effectiveness on HMM training are presented in this paper. In HMM based speech recognition, the context-dependent states are usually modeled by Gaussian mixture distributions. In general, the number of Gaussian mixtures for each state is fixed or proportional to the amount of training data. From our study, some of the states are ``non-aggressive'' compared to others, and a higher acoustic resolution is required for them. Two methods are presented in this paper to determine those non-aggressive states. The first approach uses the recognition accuracy of the states and the second method is based on a rank distribution of states. Baseline systems, trained by fixed number of Gaussian mixtures for each state, with 33K and 120K Gaussians yield 14.57% and 13.04% word error rates, respectively. Using our approaches, a 38K Gaussians was constructed that reduces the error rate to 13.95%. The average ranks of non-aggressive states in rank lists of testing data were also seems to dramatically improve compared to the baseline systems.

SP-4.8

Prosodic Word Boundary Detection Using Statistical Modeling of Moraic Fundamental Frequency Contours and Its Use for Continuous Speech Recognition
Koji Iwano, Keikichi Hirose (University of Tokyo)

A new method for prosodic word boundary detection in continuous speech was developed based on the statistical modeling of moraic transitions of fundamental frequency (F0) contours, formerly proposed by the authors. In the developed method, F0 contours of prosodic words were modeled separately according to the accent types. An input utterance was matched against the models and was divided into constituent prosodic words. By doing so, prosodic word boundaries can be obtained. The method was first applied to the boundary detection experiments of ATR continuous speech corpus. With mora boundary locations given in the corpus, total detection rate reached 91.5%. Then the method was integrated into a continuous speech recognition scheme with unlimited vocabulary. A few percentage improvement was observed in mora recognition for the above corpus. Although all the experiments done in closed conditions due to the corpus availability, the results indicated the usefulness of the proposed method.

< SP-3 SP-5 >

Last Update: February 4, 1999 Ingo Höntsch