Session: SPEECH-P12
Time: 9:30 - 11:30, Friday, May 11, 2001
Location: Exhibit Hall Area 7
Title: Acoustic Modeling 2
Chair: Qiang Huo

9:30, SPEECH-P12.1
SPEECH RECOGNITION FOR DARPA COMMUNICATOR
A. AARON, S. CHEN, P. COHEN, S. DHARANIPRAGADA, E. EIDE, M. FRANZ, J. LE ROUX, X. LUO, B. MAISON, L. MANGU, T. MATHES, M. NOVAK, P. OLSEN, M. PICHENY, H. PRINTZ, B. RAMABHADRAN, A. SAKRAJDA, G. SAON, B. TYDLITAT, K. VISWESWARIAH, D. YUK
We report the results of investigations in acoustic modeling, language modeling, and decoding techniques, for DARPA Communicator, a speaker- independent, telephone-based dialog system. By a combination of methods, including enlarging the acoustic model, augmenting the recognizer vocabulary, conditioning the language model upon dialog state, and applying a post-processing decoding method, we lowered the overall word error rate from 21.9% to 15.0%, a gain of 6.9% absolute and 31.5% relative.

9:30, SPEECH-P12.2
EFFICIENT MIXTURE GAUSSIAN SYNTHESIS FOR DECISION TREE BASED STATE TYING
T. KATO, S. KUROIWA, T. SHIMIZU, N. HIGUCHI
We propose an efficient mixture Gaussian synthesis method for decision tree based state tying that produces better context-dependent models in a short period of training time. This method makes it possible to handle mixture Gaussian HMMs in decision tree based state tying algorithm, and provides higher recognition performance compared to the conventional HMM training procedure using decision tree based state tying on single Gaussian HMMs. This method also reduces the steps of HMM training procedure because the mixture incrementing process is not necessary. We applied this method to training of telephone speech triphones, and evaluated its effect on Japanese phonetically balanced sentence tasks. Our method achieved a 1 to 2 point inprovement in phoneme accuracy and a 67% reduction in training time.

9:30, SPEECH-P12.3
DISCRIMINATIVE TRAINING OF HMM USING MAXIMUM NORMALIZED LIKELIHOOD ALGORITHM
K. MARKOV, S. NAKAGAWA, S. NAKAMURA
In this paper, we present the Maximum Normalized Likelihood Estimation (MNLE) algorithm and its application for discriminative training of HMMs for continuous speech recognition. The objective of this algorithm is to maximize the normalized frame likelihood of training data. In contrast to other discriminative algorithms such as Minimum Classification Error/Generalized Probabilistic Descent (MCE/GPD) and Maximum Mutual Information (MMI), the objective function is optimized using a modified Expectation-Maximization(EM) algorithm which greatly simplifies and speeds up the training procedure. Evaluation experiments showed better recognition rates compared to both the Maximum Likelihood (ML) training method and MCE/GPD discriminative method.In addition,the MNLE algorithm showed better generalization abilities and was faster than MCE/GPD.

9:30, SPEECH-P12.4
MODELING UNCERTAINTY OF DATA OBSERVATION
A. WENDEMUTH
Modeling Uncertainty of Data Observation Andreas Wendemuth Philips Research Laboratories Weisshausstrasse 2, D-52066 Aachen, Germany email: Andreas.Wendemuth@philips.com An approach is presented both theoretically and experimentally which overcomes a number of existing conceptual and performance problems in density estimation. The theoretical approach shows methods for incorporating or estimating uncertainties into speech recognition. In the MMI and ML case, precise formulae are given for estimation of densities for uncertainty variances small compared to the curvature of the posteriors. For implementation, the theoretical formulae are presented in such a way that the additional computation effort goes linearly with the number of densities. Experiments on car digits show relative improvements in word error rate of at most 4.8% relative. Uncertainty modelling is shown to help remedy effects of the sparse data problem in density estimation.

9:30, SPEECH-P12.5
MODULAR NEURAL NETWORKS EXPLOIT LARGE ACOUSTIC CONTEXT THROUGH BROAD-CLASS POSTERIORS FOR CONTINUOUS SPEECH RECOGNITION
C. ANTONIOU
Traditionally, neural networks such as multi-layer perceptrons handle acoustic context by increasing the dimensionality of the observation vector, in order to include information of the neighbouring acoustic vectors, on either side of the current frame. As a result the monolithic network is trained on a high multi-dimensional space. The trend is to use the same fixed-size observation vector across the one network that estimates the posterior probabilities for all phones, simultaneously. We propose a decomposition of the network into modular components, where each component estimates a phone posterior. The size of the observation vector we use, is not fixed across the modularised networks, but rather accounts for the phone that each network is trained to classify. For each observation vector, we estimate very large acoustic context through broad-class posteriors. The use of the broad-class posteriors along with the phone posteriors greatly enhance acoustic modelling. We report significant improvements in phone classification and word recognition on the TIMIT corpus. Our results are also better than the best context-dependent system in the literature.

9:30, SPEECH-P12.6
MULTIPLE MIXTURE SEGMENTAL HMM AND ITS APPLICATIONS
B. XIANG, T. BERGER
In this paper a multiple mixture segmental hidden Markov model (MMSHMM) is presented. This model is extended from the linear probabilistic-trajectory segmental HMM. Each segment is characterized by multiple linear trajectories with slope and mid-point parameters, and also the residual error covariances around the trajectories, so that both extra-segmental and intra-segment variations are represented. Instead of modeling single distribution for each model parameter as earlier work, we use multiple mixture components for model parameters to represent the variabilities due to the variations within each speaker and also the differences between speakers. This model is evaluated on two applications. One is a phonetic classification task with TIMIT corpus, which shows advantages over conventional HMM. Another one is a speaker-independent keyword spotting task with the Road Rally database. By rescoring putative events hypothesized by a primary HMM keyword spotter, the experiments show that the performance is improved through distinguishing true hits from false alarms.

9:30, SPEECH-P12.7
MULTIPLE-REGRESSION HIDDEN MARKOV MODEL
K. FUJINAGA, M. NAKAI, H. SHIMODAIRA, S. SAGAYAMA
This paper proposes a new class of hidden Markov model (HMM) called multiple-regression HMM (MR-HMM) that utilizes auxiliary features such as fundamental frequency (F0) and speaking styles that affect spectral parameters to better model the acoustic features of phonemes. Though such auxiliary features are considered to be the factors that degrade the performance of speech recognizers, the proposed MR-HMM adapts its model parameters, i.e. mean vectors of output probability distributions, depending on these auxiliary information to improve the recognition accuracy. Formulation for parameter reestimation of MR-HMM based on the EM algorithm is given in the paper. Experiments of speaker-dependent isolated word recognition demonstrated that MR-HMMs using F0 as an auxiliary feature reduced the error rates by more than 18% compared with the conventional HMMs.

9:30, SPEECH-P12.8
TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION
D. ELLIS, R. SINGH, S. SIVADAS
In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional HMM speech recognizer, which relearns the associations to sub-word units. In our previous experience with a small-vocabulary, high-noise digits task, the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. In this paper, we apply the tandem approach to SPINE1, a larger task involving spontaneous speech. We find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these improvements do not carry over to context-dependent models. This may be attributable to several factors which are discussed in the paper.

9:30, SPEECH-P12.9
TOWARDS TASK-INDEPENDENT SPEECH RECOGNITION
F. LEFEVRE, J. GAUVAIN, L. LAMEL
Despite the considerable progress made in the last decade, speech recognition is far from a solved problem. For instance, porting a recognition system to a new task (or language) still requires substantial investment of time and money and also the expertise of speech recognition technologists. This paper takes a first step at evaluating to what extent a generic state-of-the-art speech recognizer can reduce the manual effort required for system development. We investigate the genericity of wide domain models, such as broadcast news acoustic and language models, and techniques to achieve a higher degree of genericity, such as transparent methods to adapt such models for a given task. In this work, three tasks are targeted using commonly available corpora: small vocabulary recognition (TI-digits), text dictation (WSJ), and goal-oriented spoken dialog (ATIS).

9:30, SPEECH-P12.10
NONLINEAR DYNAMICAL SYSTEM BASED ACOUSTIC MODELING FOR ASR
N. WARAKAGODA, M. JOHNSEN
The work presented here is centered around a speech production model called Chained Dynamical System Model (CDSM) which is motivated by the fundamental limitations of the mainstream ASR approaches. The CDSM is essentially a smoothly time varying continuous state nonlinear dynamical system, consisting of two sub dynamical systems coupled as a chain so that one system controls the parameters of the next system. The speech recognition problem is posed as inverting the CDSM, for which we propose a solution based on the theory of Embedding. The resulting architecture, which we call Inverted CDSM (ICDSM) is evaluated in a set of experiments involving a speaker independent, continuous speech recognition task on the TIMIT database. Results of these experiments which compare favorably with the corresponding results in the literature, confirm the feasibility and advantages of the approach.