Authors:
Tatsuya Kawahara,
Shuji Doshita,
Page (NA) Paper number 1687
Abstract:
A topic independent lexical and language modeling for robust key-phrase
detection and verification is presented. Instead of assuming a domain
specific lexicon and language model, our model is designed to characterize
filler phrases depending on the speaking-style, thus can be trained
with large corpora of different topics but the same style. Mutual information
criterion is used to select topic independent filler words and their
N-gram model is used for verification of key-phrase hypotheses. A dialogue-style
dependent filler model improves the key-phrase detection in different
dialogue applications. A lecture-style dependent model is trained with
transcriptions of various oral presentations by filtering out topic
specific words. It performs much better verification of key-phrases
uttered during lectures of different topics compared with the conventional
syllable-based model and large vocabulary model.
Authors:
Kwok Leung Lam,
Pascale Fung,
Page (NA) Paper number 2359
Abstract:
We propose a new confidence score for decoding and verification. Since
the traditional log likelihood ratio (LLR) is borrowed from speaker
verification technique, it may not be apropriate for decoding because
we do not have a good modelling and definition of LLR for decoding/utterance
verification. We have proposed a new formulation of LLR that can be
used for decoding and verification task. Experimental results show
that our proposed LLR can perform equally well compared with the result
based on maximum likelihood in a decoding task. Also, we get an 5%
improvement in decoding compared with traditional LLR.
Authors:
Katrin Kirchhoff,
Jeff A Bilmes,
Page (NA) Paper number 2395
Abstract:
A recent development in the hybrid HMM/ANN speech recognition paradigm
is the use of several subword classifiers, each of which provides different
information about the speech signal. Although the combining methods
have obtained promising results, the strategies so far proposed have
been relatively simple. In most cases frame-level subword unit probabilities
are combined using an unweighted product or sum rule. In this paper,
we argue and empirically demonstrate that the classifier combination
approach can benefit from a dynamically weighted combination rule,
where the weights are derived from higher-than-frame-level confidence
values.
Authors:
Yeou-Jiunn Chen,
Chung-Hsien Wu,
Gwo-Lang Yan,
Page (NA) Paper number 1366
Abstract:
In this paper, the prosodic information, a very special and important
feature in Mandarin speech, is used for Mandarin telephone speech utterance
verification. A two-stage strategy, with recognition followed by verification,
is adopted. For keyword recognition, 59 context-independent subsyllables,
i.e., 22 INITIAL's and 37 FINAL's in Mandarin speech, and one background/silence
model, are used as the basic recognition units. For utterance verification,
12 anti-subsyllable HMM's, 175 context-dependent prosodic HMM's, and
five anti-prosodic HMM's, are constructed. A keyword verification function
combining phonetic-phase and prosodic-phase verification is investigated.
Using a test set of 2400 conversational speech utterances from 20 speakers
(12 males and 8 females), at 8.5% false rejection, the proposed verification
method resulted in 17.8% false alarm rate. Furthermore, this method
was able to correctly reject 90.4% of nonkeywords. Comparison with
a baseline system without prosodic-phase verification shows that the
prosodic information can benefit the verification performance.
Authors:
Hiroshi Matsuo,
Masaaki Ishigame,
Page (NA) Paper number 1609
Abstract:
We propose an error correction technique for speaker-independent isolated
word recognition by compensating for a word's likelihood. Likelihood
is compensated for by likelihood calculated by a phonetic bigram. The
phonetic bigram is a phoneme model expressing frame correlation within
an utterance. A speaker-independent isolated word recognition experiment
showed that our proposed technique reduces recognition error compared
to conventional techniques. The proposed technique achieves performance
almost equal that without speaker adaptation compared to the conventional
phoneme model adapted using several words.
Authors:
Andreas M Wendemuth, Philips Research Labs Aachen Germany (Germany)
Georg Rose, Philips Research Labs Aachen Germany (Germany)
J.G.A. Dolfing, Philips Research Labs Aachen Germany (Germany)
Page (NA) Paper number 1664
Abstract:
This paper adresses the correct choice and combination of confidence
measures in large vocabulary speech recognition tasks. We classify
single words within continuous as well as large vocabulary utterances
into two categories: utterances within the vocabulary which are recognized
correctly, and other utterances, namely misrecognized utterances or
(less frequent) out-of-vocabulary (OOV). To this end, we investigate
the confidence error rate (CER) for several classes of confidence measures
and transformations. In particular, we employed data-independent and
data-dependent measures. The transformations we investigated include
mapping to single confidence measures and linear combinations of these
measures. These combinations are computed by means of neural networks
trained with Bayes-optimal, and with Gardner-Derrida-optimal criteria.
Compared to a recognition system without confidence measures, the selection
of (various combinations of) confidence measures, the selection of
suitable neural network architectures and training methods, continuously
improves the CER.
Authors:
Denis Jouvet, France Télécom, CNET (France)
Katarina Bartkova, France Télécom, CNET (France)
Guy Mercier, France Télécom, CNET (France)
Page (NA) Paper number 1663
Abstract:
An efficient rejection procedure is necessary to reject out-of-vocabulary
words and noise tokens that occur in voice activated vocal services.
Garbage or filler models are very useful for such a task. However,
a post-processing of the recognized hypothesis, based on a likelihood
ratio statistic test, can refine the decision and improve performance.
These tests can be applied either on acoustic parameters or on phonetic
or prosodic parameters that are not taken into account by the HMM-based
decoder. This paper focuses on the post-processing procedure and shows
that making the likelihood ratio decision threshold dependent on the
recognized hypothesis largely improves the efficiency of the rejection
procedure. Models and anti-models are one of the key-points of such
an approach. Their training and usage are also discussed, as well as
the contextual modeling involved. Finally results are reported on a
field database collected from a 2000-word directory task using various
phonetic and prosodic parameters.
Authors:
Jeff A Bilmes,
Page (NA) Paper number 2105
Abstract:
Good HMM-based speech recognition performance requires at most minimal
inaccuracies to be introduced by HMM conditional independence assumptions.
In this work, HMM conditional independence assumptions are relaxed
in a principled way. For each hidden state value, additional dependencies
are added between observation elements to increase both accuracy and
discriminability. These additional dependencies are chosen according
to natural statistical dependencies extant in training data that are
not well modeled by an HMM. The result is called a buried Markov model
(BMM) because the underlying Markov chain in an HMM is further hidden
(buried) by specific cross-observation dependencies. Gaussian mixture
HMMs are extended to represent BMM dependencies and new EM update equations
are derived. On preliminary experiments with a large-vocabulary isolated-word
speech database, BMMs are able to achieve an 11% improvement in WER
with only a 9.5% increase in the number of parameters using a single
state per mono-phone speech recognition system.
|