Authors:
Man-Hung Siu,
Michael Jonas,
Herbert Gish,
Page (NA) Paper number 2472
Abstract:
How to train a speech recognizer with limited amount of training data
is of interest to many researcher. In this paper, we describe how we
use BBN's Byblos large vocabulary continuous speech recognition (LVCSR)
system for the military air-traffic-control domain where we have less
than an hour of training data. We investigate three ways to deal with
the limited training data: 1) re-configure the LVCSR system to use
fewer parameters, 2) incorporate out-of-domain data, and, 3) use pragmatic
information, such as speaker identity and controller function to improve
recognition performance. We compare the LVCSR performance to that of
the tied-mixture recognizer that is designed for limited vocabulary.
We show that the reconfigured LVCSR system out-performs the tied-mixture
system by 10% in absolute word error rate. When enough data is available
per speaker, vocal tract length normalization and supervised adaptation
techniques can further improve performance by 6% even for this domain
with limited training. We also show that the use of out-of-domain data
and pragmatic information, if available, can each further improve performance
by 1-3%.
Authors:
Joseph Picone,
Sandi Pike,
Roland Reagan,
Terri Kamm,
John S Bridle, Dragon Systems U.K. (U.K.)
Li Deng,
Jeff Ma,
Hywel B Richards, Dragon Systems U.K. (U.K.)
Mike Schuster,
Page (NA) Paper number 2339
Abstract:
Conversational speech recognition is a challenging problem primarily
because speakers rarely fully articulate sounds. A successful speech
recognition approach must infer intended spectral targets from the
speech data, or develop a method of dealing with large variances in
the data. Hidden Dynamic Models (HDMs) attempt to automatically learn
such targets in a hidden feature space using models that integrate
linguistic information with constrained temporal trajectory models.
HDMs are a radical departure from conventional hidden Markov models
(HMMs), which simply account for variation in the observed data. In
this paper, we present an initial evaluation of such models on a conversational
speech recognition task involving a subset of the SWITCHBOARD corpus.
We show that in an N-Best rescoring paradigm, HDMs are capable of delivering
performance competitive with HMMs.
Authors:
Spyros Matsoukas,
George Zavaliagkos,
Page (NA) Paper number 2379
Abstract:
In continuous density Hidden Markov Models (HMMs) for speech recognition,
the probability density function (pdf) for each state is usually expressed
as a mixture of Gaussians. In this paper, we present a model in which
the pdf is expressed as the convolution of two densities. We focus
on the special case where one of the convolved densities is a M-Gaussian
mixture and the other is a mixture of N impulses. We present the reestimation
formulae for the parameters of the MxN convolutional model, and suggest
two ways for initializing them, the residual K-Means approach, and
the deconvolution from a standard HMM with MN Gaussians per state using
a genetic algorithm to search for the optimal assignment of Gaussians.
Both methods result in a compact representation that requires only
O(M + N) storage space for the model parameters, and O(MN) time for
training and decoding. We explain how the decoding time can be reduced
to O(M + kN), where k < M. Finally, results are shown on the 1996 Hub-4
Development test, demonstrating that a 32x2 convolutional model can
achieve performance comparable to that of a standard 64-Gaussian per
state model.
Authors:
Rita Singh,
Bhiksha Raj,
Richard M Stern,
Page (NA) Paper number 2487
Abstract:
Most current automatic speech recognition systems based on HMMs cluster
or tie together subsets of the subword units with which speech is represented.
This tying improves recognition accuracy when systems are trained with
limited data, and is performed by classifying the sub-phonetic units
using a series of binary tests based on speech production, called "linguistic
questions". This paper describes a new method for automatically determining
the best combinations of subword units to form these questions. The
hybrid algorithm proposed clusters state distributions of context-independent
phones to obtain questions for triphonetic contexts. Experiments confirm
that the questions thus generated can replace manually generated questions
and can provide improved recognition accuracy. Automatic generation
of questions has the additional important advantage of extensibility
to languages for which the phonetic structure is not well understood
by the system designer, and can be effectively used in situations where
the subword units are not phonetically motivated.
Authors:
Tetsunori Kobayashi,
Junko Furuyama,
Ken Masumitsu,
Page (NA) Paper number 2323
Abstract:
A new pattern matching method, Partly Hidden Markov Model, is proposed
and applied to speech recognition. Hidden Markov Model, which is widely
used for speech recognition, can deal with only piecewise stationary
stochastic process. We solved this problem by introducing the modified
second order Markov Model, in which the first state is hidden and the
second one is observable. In this model, not only the feature parameter
observations but also the state transitions are dependent on the previous
feature observation. Therefore, even the compricated transient can
be modeled precisely. Some simulational experiments showed the high
potential of the proposed model. As the results of word recognition
test, the error rate was reduced by 39% compared with normal HMM.
Authors:
Jae H Kim,
Raziel Haimi-Cohen,
Frank K Soong,
Page (NA) Paper number 2337
Abstract:
This paper describes a method to significantly reduce the complexity
of continuous density HMM with only a small degradation in performance.
The proposed method is noise-robust and may perform even better than
the standard algorithm if training and testing noise conditions are
not matched. The method is based on approximating the variance vectors
of the Gaussian kernels by a vector quantization (VQ) codebook of a
small size. The quantization of the variance vectors is done using
an information theoretic distortion measure. Closed form expressions
are given for the computation of the VQ codebook and the superiority
of the proposed distortion measure over the Euclidean distance is demonstrated.
The effectiveness of the proposed method is shown using the connected
TI digits database and a noisy version of it. For the connected TI
digit database, the proposed method shows that by quantizing the variance
to 16 levels we can maintain recognition performance within 1% degradation
of the original VR system. In comparison, with Euclidean distortion,
a size 256 codebook is needed for a similar error rate.
Authors:
Yuqing Gao,
Ea-Ee Jan,
Mukund Padmanabhan,
Michael Picheny,
Page (NA) Paper number 2483
Abstract:
Two discriminant measures for HMM states to improve effectiveness on
HMM training are presented in this paper. In HMM based speech recognition,
the context-dependent states are usually modeled by Gaussian mixture
distributions. In general, the number of Gaussian mixtures for each
state is fixed or proportional to the amount of training data. From
our study, some of the states are ``non-aggressive'' compared to others,
and a higher acoustic resolution is required for them. Two methods
are presented in this paper to determine those non-aggressive states.
The first approach uses the recognition accuracy of the states and
the second method is based on a rank distribution of states. Baseline
systems, trained by fixed number of Gaussian mixtures for each state,
with 33K and 120K Gaussians yield 14.57% and 13.04% word error rates,
respectively. Using our approaches, a 38K Gaussians was constructed
that reduces the error rate to 13.95%. The average ranks of non-aggressive
states in rank lists of testing data were also seems to dramatically
improve compared to the baseline systems.
Authors:
Koji Iwano,
Keikichi Hirose,
Page (NA) Paper number 2237
Abstract:
A new method for prosodic word boundary detection in continuous speech
was developed based on the statistical modeling of moraic transitions
of fundamental frequency (F0) contours, formerly proposed by the authors.
In the developed method, F0 contours of prosodic words were modeled
separately according to the accent types. An input utterance was matched
against the models and was divided into constituent prosodic words.
By doing so, prosodic word boundaries can be obtained. The method was
first applied to the boundary detection experiments of ATR continuous
speech corpus. With mora boundary locations given in the corpus, total
detection rate reached 91.5%. Then the method was integrated into a
continuous speech recognition scheme with unlimited vocabulary. A few
percentage improvement was observed in mora recognition for the above
corpus. Although all the experiments done in closed conditions due
to the corpus availability, the results indicated the usefulness of
the proposed method.
|