Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-4 |
|
SP-4.1
|
Using a large vocabulary continuous speech recognizer for a constrained domain with limited training
Man-hung Siu,
Michael Jonas,
Herbert Gish (BBN Technologies/GTE Internetworking)
How to train a speech recognizer with limited amount of training data
is of interest to many researcher. In this paper, we describe how we
use BBN's Byblos large vocabulary continuous speech recognition
(LVCSR) system for the military air-traffic-control domain where we
have less than an hour of training data. We investigate three ways to
deal with the limited training data: 1) re-configure the LVCSR system
to use fewer parameters, 2) incorporate out-of-domain data, and, 3)
use pragmatic information, such as speaker identity and controller
function to improve recognition performance. We compare the LVCSR
performance to that of the tied-mixture recognizer that is designed
for limited vocabulary. We show that the reconfigured LVCSR system
out-performs the tied-mixture system by 10% in absolute word error
rate. When enough data is available per speaker, vocal tract length
normalization and supervised adaptation techniques can further improve
performance by 6% even for this domain with limited training. We
also show that the use of out-of-domain data and pragmatic
information, if available, can each further improve performance by
1-3%.
|
SP-4.2
|
Initial Evaluation of Hidden Dynamic Models on Conversational Speech
Joseph Picone (Institute for Signal and Information Processing, MS State),
Sandi Pike (Brown University),
Roland Reagan (Carnegie Mellon University),
Terri Kamm (Department of Defense),
John Bridle (Dragon Systems U.K.),
Li Deng,
Jeff Ma (University of Waterloo),
Hywel Richards (Dragon Systems U.K.),
Mike Schuster (ATR)
Conversational speech recognition is a challenging
problem primarily because speakers rarely fully
articulate sounds. A successful speech recognition
approach must infer intended spectral targets
from the speech data, or develop a method of
dealing with large variances in the data.
Hidden Dynamic Models (HDMs) attempt to automatically
learn such targets in a hidden feature space using
models that integrate linguistic information with
constrained temporal trajectory models. HDMs are
a radical departure from conventional hidden Markov
models (HMMs), which simply account for variation
in the observed data. In this paper, we present an
initial evaluation of such models on a conversational
speech recognition task involving a subset of the
SWITCHBOARD corpus. We show that in an N-Best
rescoring paradigm, HDMs are capable of delivering
performance competitive with HMMs.
|
SP-4.3
|
Convolutional Density Estimation in Hidden Markov Models for Speech Recognition
Spyros Matsoukas,
George Zavaliagkos (BBN Technologies, GTE Internetworking)
In continuous density Hidden Markov Models (HMMs) for speech recognition, the probability density function (pdf) for each state is usually expressed as a mixture of Gaussians. In this paper, we present a model in which the pdf is expressed as the convolution
of two densities. We focus on the special case where one of the convolved densities is a M-Gaussian mixture and the other is a mixture of N impulses. We present
the reestimation formulae for the parameters of the MxN convolutional model, and suggest two ways for initializing them, the residual K-Means approach, and the deconvolution from a standard HMM with MN Gaussians per state using a genetic algorithm to search for the optimal assignment of Gaussians. Both methods result in a compact representation that requires only O(M + N) storage space for the model parameters, and O(MN) time for training and decoding. We explain how the decoding time can be reduced to
O(M + kN), where k < M. Finally, results are shown on the 1996 Hub-4 Development test, demonstrating that a 32x2 convolutional model can achieve performance
comparable to that of a standard 64-Gaussian per state model.
|
SP-4.4
|
AUTOMATIC CLUSTERING AND GENERATION OF CONTEXTUAL QUESTIONS FOR TIED STATES IN HIDDEN MARKOV MODELS
Rita Singh,
Bhiksha Raj,
Richard M Stern (Carnegie Mellon University)
Most current automatic speech recognition systems
based on HMMs cluster or tie together subsets of
the subword units with which speech is represented.
This tying improves recognition accuracy when systems
are trained with limited data, and is performed by
classifying the sub-phonetic units using a series of
binary tests based on speech production, called
"linguistic questions". This paper describes a new
method for automatically determining the best
combinations of subword units to form these questions.
The hybrid algorithm proposed clusters state
distributions of context-independent phones to
obtain questions for triphonetic contexts.
Experiments confirm that the questions thus generated
can replace manually generated questions and can
provide improved recognition accuracy.
Automatic generation of questions has the additional
important advantage of extensibility to languages for
which the phonetic structure is not well understood
by the system designer, and can be effectively used
in situations where the subword units are not
phonetically motivated.
|
SP-4.5
|
Partly Hidden Markov Model and its Application to Speech Recognition
Tetsunori Kobayashi,
Junko Furuyama,
Ken Masumitsu (Waseda University)
A new pattern matching method, Partly Hidden Markov
Model, is proposed and applied to speech recognition.
Hidden Markov Model, which is widely used for speech
recognition, can deal with only piecewise stationary
stochastic process. We solved this problem by
introducing the modified second order Markov Model,
in which the first state is hidden and the second
one is observable. In this model, not only the feature
parameter observations but also the state transitions
are dependent on the previous feature observation.
Therefore, even the compricated transient can be
modeled precisely.
Some simulational experiments showed the high
potential of the proposed model.
As the results of word recognition test, the error
rate was reduced by 39\% compared with normal HMM.
|
SP-4.6
|
HIDDEN MARKOV MODELS WITH DIVERGENCE BASED VECTOR QUANTIZED VARIANCES
Jae H Kim,
Raziel Haimi-Cohen,
Frank K Soong (Lucent Technologies)
This paper describes a method to significantly reduce the complexity of continuous density HMM with only a small degradation in performance. The proposed method is noise-robust and may perform even better than the standard algorithm if training and testing noise conditions are not matched. The method is based on approximating the variance vectors of the Gaussian kernels by a vector quantization (VQ) codebook of a small size. The quantization of the variance vectors is done using an information theoretic distortion measure. Closed form expressions are given for the computation of the VQ codebook and the superiority of the proposed distortion measure over the Euclidean distance is demonstrated. The effectiveness of the proposed method is shown using the connected TI digits database and a noisy version of it. For the connected TI digit database, the proposed method shows that by quantizing the variance to 16 levels we can maintain recognition performance within 1% degradation of the original VR system. In comparison, with Euclidean distortion, a size 256 codebook is needed for a similar error rate.
|
SP-4.7
|
HMM Training Based on Quality Measurement
Yuqing Gao,
Ea-Ee Jan,
Mukund Padmanabhan,
Michael Picheny (IBM T.J. Watson Research Center)
Two discriminant measures for HMM states to improve effectiveness on
HMM training are presented in this paper. In HMM based speech
recognition, the context-dependent states are usually modeled by Gaussian
mixture distributions. In general, the number of Gaussian mixtures for
each state is fixed or proportional to the amount of training data.
From our study, some of the states are ``non-aggressive'' compared to others,
and a higher acoustic resolution is required for them. Two methods are
presented in this paper to determine those non-aggressive states.
The first approach uses the recognition accuracy of the states and the
second method is based on a rank distribution of states. Baseline
systems, trained by fixed number of Gaussian mixtures for each state,
with 33K and 120K Gaussians yield 14.57% and 13.04% word error
rates, respectively. Using our approaches, a 38K Gaussians was constructed
that reduces the error rate to 13.95%. The average ranks of
non-aggressive states in rank lists of testing data were also seems to
dramatically improve compared to the baseline systems.
|
SP-4.8
|
Prosodic Word Boundary Detection Using Statistical Modeling of Moraic Fundamental Frequency Contours and Its Use for Continuous Speech Recognition
Koji Iwano,
Keikichi Hirose (University of Tokyo)
A new method for prosodic word boundary detection in continuous speech
was developed based on the statistical modeling of moraic transitions of
fundamental frequency (F0) contours, formerly proposed by the authors.
In the developed method, F0 contours of prosodic words were modeled
separately according to the accent types. An input utterance was
matched against the models and was divided into constituent prosodic
words. By doing so, prosodic word boundaries can be obtained. The
method was first applied to the boundary detection experiments of ATR
continuous speech corpus. With mora boundary locations given in the
corpus, total detection rate reached 91.5%. Then the method was
integrated into a continuous speech recognition scheme with unlimited
vocabulary. A few percentage improvement was observed in mora
recognition for the above corpus. Although all the experiments done
in closed conditions due to the corpus availability, the results
indicated the usefulness of the proposed method.
|
|