9:30, SPEECH-P12.1
SPEECH RECOGNITION FOR DARPA COMMUNICATOR
A. AARON, S. CHEN, P. COHEN, S. DHARANIPRAGADA, E. EIDE, M. FRANZ, J. LE ROUX, X. LUO, B. MAISON, L. MANGU, T. MATHES, M. NOVAK, P. OLSEN, M. PICHENY, H. PRINTZ, B. RAMABHADRAN, A. SAKRAJDA, G. SAON, B. TYDLITAT, K. VISWESWARIAH, D. YUK
We report the results of investigations in acoustic modeling, language
modeling, and decoding techniques, for DARPA Communicator, a speaker-
independent, telephone-based dialog system. By a combination of
methods, including enlarging the acoustic model, augmenting the
recognizer vocabulary, conditioning the language model upon dialog
state, and applying a post-processing decoding method, we lowered
the overall word error rate from 21.9% to 15.0%, a gain of 6.9%
absolute and 31.5% relative.
9:30, SPEECH-P12.2
EFFICIENT MIXTURE GAUSSIAN SYNTHESIS FOR DECISION TREE BASED STATE TYING
T. KATO, S. KUROIWA, T. SHIMIZU, N. HIGUCHI
We propose an efficient mixture Gaussian synthesis method for decision tree based state tying that produces better context-dependent models in a short period of training time. This method makes it possible to handle mixture Gaussian HMMs in decision tree based state tying algorithm, and provides higher recognition performance compared to the conventional HMM training procedure using decision tree based state tying on single Gaussian HMMs. This method also reduces the steps of HMM training procedure because the mixture incrementing process is not necessary. We applied this method to training of telephone speech triphones, and evaluated its effect on Japanese phonetically balanced sentence tasks. Our method achieved a 1 to 2 point inprovement in phoneme accuracy and a 67% reduction in training time.
9:30, SPEECH-P12.3
DISCRIMINATIVE TRAINING OF HMM USING MAXIMUM NORMALIZED LIKELIHOOD ALGORITHM
K. MARKOV, S. NAKAGAWA, S. NAKAMURA
In this paper, we present the Maximum Normalized Likelihood Estimation (MNLE) algorithm and its application for discriminative training of HMMs for continuous speech recognition. The objective of this algorithm is to maximize the normalized frame likelihood of training data. In contrast to other discriminative algorithms such as Minimum Classification Error/Generalized Probabilistic Descent (MCE/GPD) and Maximum Mutual Information (MMI), the objective function is optimized using a modified Expectation-Maximization(EM) algorithm which greatly simplifies and speeds up the training procedure. Evaluation experiments showed better recognition rates compared to both the Maximum Likelihood (ML) training method and MCE/GPD discriminative method.In addition,the MNLE algorithm showed better generalization abilities and was faster than MCE/GPD.
9:30, SPEECH-P12.4
MODELING UNCERTAINTY OF DATA OBSERVATION
A. WENDEMUTH
Modeling Uncertainty of Data Observation
Andreas Wendemuth
Philips Research Laboratories
Weisshausstrasse 2, D-52066 Aachen, Germany
email: Andreas.Wendemuth@philips.com
An approach is presented both theoretically and experimentally which
overcomes a number of existing conceptual and performance problems in density estimation. The theoretical approach shows methods for incorporating or estimating uncertainties into speech recognition. In the MMI and ML case, precise formulae are given for estimation of densities for uncertainty variances small compared to the curvature of the posteriors.
For implementation, the theoretical formulae are presented in such a
way that the additional computation effort goes linearly with the number of densities.
Experiments on car digits show relative improvements in word error rate of at most 4.8% relative. Uncertainty modelling is shown to help remedy effects of the sparse data problem in density estimation.
9:30, SPEECH-P12.5
MODULAR NEURAL NETWORKS EXPLOIT LARGE ACOUSTIC CONTEXT THROUGH BROAD-CLASS POSTERIORS FOR CONTINUOUS SPEECH RECOGNITION
C. ANTONIOU
Traditionally, neural networks such as multi-layer perceptrons handle acoustic context by increasing the dimensionality of the observation vector, in order to include information of the neighbouring acoustic vectors, on either side of the current frame. As a result the monolithic network is trained on a high multi-dimensional space. The trend is to use the same fixed-size observation vector across the one network that estimates the posterior probabilities for all phones, simultaneously. We propose a decomposition of the network into modular components, where each component estimates a phone posterior. The size of the observation vector we use, is not fixed across the modularised networks, but rather accounts for the phone that each network is trained to classify. For each observation vector, we estimate very large acoustic context through broad-class posteriors. The use of the broad-class posteriors along with the phone posteriors greatly enhance acoustic modelling. We report significant improvements in phone classification and word recognition on the TIMIT corpus. Our results are also better than the best context-dependent system in the literature.
9:30, SPEECH-P12.6
MULTIPLE MIXTURE SEGMENTAL HMM AND ITS APPLICATIONS
B. XIANG, T. BERGER
In this paper a multiple mixture segmental hidden Markov model (MMSHMM) is presented. This model is extended from the linear probabilistic-trajectory segmental HMM. Each segment is characterized by multiple linear trajectories with slope and mid-point parameters, and also the residual error covariances around the trajectories, so that both extra-segmental and intra-segment variations are represented. Instead of modeling single distribution for each model parameter as earlier work, we use multiple mixture components for model parameters to represent the variabilities due to the variations within each speaker and also the differences between speakers. This model is evaluated on two applications. One is a phonetic classification task with TIMIT corpus, which shows advantages over conventional HMM. Another one is a speaker-independent keyword spotting task with the Road Rally database. By rescoring putative events hypothesized by a primary HMM keyword spotter, the experiments show that the performance is improved through distinguishing true hits from false alarms.
9:30, SPEECH-P12.7
MULTIPLE-REGRESSION HIDDEN MARKOV MODEL
K. FUJINAGA, M. NAKAI, H. SHIMODAIRA, S. SAGAYAMA
This paper proposes a new class of hidden Markov model (HMM) called
multiple-regression HMM (MR-HMM) that utilizes auxiliary features
such as fundamental frequency (F0) and speaking styles that affect
spectral parameters to better model the acoustic features of phonemes.
Though such auxiliary features are considered to be the factors that
degrade the performance of speech recognizers, the proposed MR-HMM
adapts its model parameters, i.e. mean vectors of output probability
distributions,
depending on these auxiliary information to improve the
recognition accuracy.
Formulation for parameter reestimation of MR-HMM based on the EM
algorithm is given in the paper.
Experiments of speaker-dependent isolated word recognition
demonstrated that MR-HMMs using F0 as an auxiliary feature
reduced the error rates by more than 18% compared with the
conventional HMMs.
9:30, SPEECH-P12.8
TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION
D. ELLIS, R. SINGH, S. SIVADAS
In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional HMM speech recognizer, which relearns the associations to sub-word units. In our previous experience with a small-vocabulary, high-noise digits task, the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. In this paper, we apply the tandem approach to SPINE1, a larger task involving spontaneous speech. We find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these improvements do not carry over to context-dependent models. This may be attributable to several factors which are discussed in the paper.
9:30, SPEECH-P12.9
TOWARDS TASK-INDEPENDENT SPEECH RECOGNITION
F. LEFEVRE, J. GAUVAIN, L. LAMEL
Despite the considerable progress made in the last decade, speech
recognition is far from a solved problem. For instance, porting a
recognition system to a new task (or language) still requires
substantial investment of time and money and also the expertise of
speech recognition technologists. This paper takes a first step at
evaluating to what extent a generic state-of-the-art speech recognizer
can reduce the manual effort required for system development.
We investigate the genericity of wide domain models, such as broadcast
news acoustic and language models, and techniques to achieve a higher
degree of genericity, such as transparent methods to adapt such models
for a given task. In this work, three tasks are targeted using
commonly available corpora: small vocabulary recognition (TI-digits),
text dictation (WSJ), and goal-oriented spoken dialog (ATIS).
9:30, SPEECH-P12.10
NONLINEAR DYNAMICAL SYSTEM BASED ACOUSTIC MODELING FOR ASR
N. WARAKAGODA, M. JOHNSEN
The work presented here is centered around a speech production model
called Chained Dynamical System Model (CDSM) which is motivated by
the fundamental limitations of the mainstream ASR approaches. The
CDSM is essentially a smoothly time varying continuous state
nonlinear dynamical system, consisting of two sub dynamical systems
coupled as a chain so that one system controls the parameters of the
next system. The speech recognition problem is posed as
inverting the CDSM, for which we propose a solution based on the
theory of Embedding. The resulting architecture, which we call
Inverted CDSM (ICDSM) is evaluated in a set of experiments involving
a speaker independent, continuous speech recognition task on the
TIMIT database. Results of these experiments which compare favorably
with the corresponding results in the literature, confirm the
feasibility and advantages of the approach.