Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-22 |
|
SP-22.1
|
Topic Independent Language Model for Key-Phrase Detection and Verification
Tatsuya Kawahara,
Shuji Doshita (Kyoto University)
A topic independent lexical and language modeling for robust
key-phrase detection and verification is presented. Instead of
assuming a domain specific lexicon and language model, our model is
designed to characterize filler phrases depending on the
speaking-style, thus can be trained with large corpora of different
topics but the same style. Mutual information criterion is used to
select topic independent filler words and their N-gram model is used
for verification of key-phrase hypotheses. A dialogue-style dependent
filler model improves the key-phrase detection in different dialogue
applications. A lecture-style dependent model is trained with
transcriptions of various oral presentations by filtering out topic
specific words. It performs much better verification of key-phrases
uttered during lectures of different topics compared with the
conventional syllable-based model and large vocabulary model.
|
SP-22.2
|
A more efficient and optimal LLR for decoding and verification
Kwok Leung LAM,
Pascale FUNG (University of Science and Technology (HKUST))
We propose a new confidence score for decoding and verification.
Since the traditional log likelihood ratio (LLR) is borrowed from speaker
verification technique, it may not be apropriate for decoding because we do
not have a good modelling and definition of LLR for decoding/utterance
verification. We have proposed a new formulation of LLR that can be used for
decoding and verification task. Experimental results show that our proposed
LLR can perform equally well compared with the result based on maximum likelihood
in a decoding task. Also, we get an 5\% improvement in decoding compared with
traditional LLR.
|
SP-22.3
|
Dynamic Classifier Combination in Hybrid Speech Recognition Systems using Utterance-Level Confidence Values
Katrin Kirchhoff (Technische Fakultaet, Universitaet Bielefeld),
Jeff A Bilmes (ICSI/U.C. Berkeley)
A recent development in the hybrid HMM/ANN speech recognition paradigm
is the use of several subword classifiers, each of which provides
different information about the speech signal. Although the combining
methods have obtained promising results, the strategies so far
proposed have been relatively simple. In most cases frame-level
subword unit probabilities are combined using an unweighted product or
sum rule. In this paper, we argue and empirically demonstrate that the
classifier combination approach can benefit from a dynamically
weighted combination rule, where the weights are derived from
higher-than-frame-level confidence values.
|
SP-22.4
|
Utterance Verification Using Prosodic Information for Mandarin Telephone Speech Keyword Spotting
Yeou-Jiunn Chen,
Chung-Hsien Wu,
Gwo-Lang Yan (Department of Computer Science and Information Engineering, National Cheng Kung University)
In this paper, the prosodic information, a very special and important feature in Mandarin speech, is used for Mandarin telephone speech utterance verification. A two-stage strategy, with recognition followed by verification, is adopted. For keyword recognition, 59 context-independent subsyllables, i.e., 22 INITIAL¡¦s and 37 FINAL¡¦s in Mandarin speech, and one background/silence model, are used as the basic recognition units. For utterance verification, 12 anti-subsyllable HMM¡¦s, 175 context-dependent prosodic HMM¡¦s, and five anti-prosodic HMM¡¦s, are constructed. A keyword verification function combining phonetic-phase and prosodic-phase verification is investigated. Using a test set of 2400 conversational speech utterances from 20 speakers (12 males and 8 females), at 8.5% false rejection, the proposed verification method resulted in 17.8% false alarm rate. Furthermore, this method was able to correctly reject 90.4% of nonkeywords. Comparison with a baseline system without prosodic-phase verification shows that the prosodic information can benefit the verification performance.
|
SP-22.5
|
Error Correction for Speaker-Independent Isolated Word Recognition through Likelihood Compensation Using Phonetic Bigram
Hiroshi Matsuo (Akita University),
Masaaki Ishigame (Iwate Prefectural University)
We propose an error correction technique for speaker-independent isolated word recognition by compensating for a word's likelihood.
Likelihood is compensated for by likelihood calculated by a phonetic bigram.
The phonetic bigram is a phoneme model expressing frame correlation within an utterance.
A speaker-independent isolated word recognition experiment showed that our proposed technique reduces recognition error compared to conventional techniques.
The proposed technique achieves performance almost equal that without speaker adaptation compared to the conventional phoneme model adapted using several words.
|
SP-22.6
|
Advances in confidence measures for large vocabulary
Andreas M Wendemuth,
Georg Rose,
Hans Dolfing (Philips Research Labs Aachen Germany)
This paper adresses the correct choice and combination
of confidence measures in large vocabulary speech
recognition tasks. We classify single words within
continuous as well as large vocabulary utterances
into two categories: utterances within
the vocabulary which are recognized correctly, and
other utterances, namely misrecognized utterances or
(less frequent) out-of-vocabulary (OOV).
To this end, we investigate the confidence error rate
(CER) for several classes of confidence measures and
transformations. In particular, we employed data-independent
and data-dependent measures. The transformations we
investigated include mapping to single confidence
measures and linear combinations of these measures. These
combinations are computed by means of neural networks
trained with Bayes-optimal, and with
Gardner-Derrida-optimal criteria.
Compared to a recognition system without confidence measures,
the selection of (various combinations of) confidence
measures, the selection of suitable neural network
architectures and training methods, continuously improves
the CER.
|
SP-22.7
|
Hypothesis Dependent Threshold Setting for Improved Out-of-Vocabulary Data Rejection
Denis Jouvet,
Katarina Bartkova,
Guy Mercier (France Télécom, CNET)
An efficient rejection procedure is necessary to reject out-of-vocabulary words and noise tokens that occur in voice activated vocal services. Garbage or filler models are very useful for such a task. However, a post-processing of the recognized hypothesis, based on a likelihood ratio statistic test, can refine the decision and improve performance. These tests can be applied either on acoustic parameters or on phonetic or prosodic parameters that are not taken into account by the HMM-based decoder.
This paper focuses on the post-processing procedure and shows that making the likelihood ratio decision threshold dependent on the recognized hypothesis largely improves the efficiency of the rejection procedure. Models and anti-models are one of the key-points of such an approach. Their training and usage are also discussed, as well as the contextual modeling involved. Finally results are reported on a field database collected from a 2000-word directory task using various phonetic and prosodic parameters.
|
SP-22.8
|
Buried Markov Models for Speech Recognition
Jeff A Bilmes (U.C. Berkeley/ICSI)
Good HMM-based speech recognition performance requires at most minimal
inaccuracies to be introduced by HMM conditional independence
assumptions. In this work, HMM conditional independence assumptions
are relaxed in a principled way. For each hidden state value,
additional dependencies are added between observation elements to
increase both accuracy and discriminability. These additional
dependencies are chosen according to natural statistical dependencies
extant in training data that are not well modeled by an HMM. The
result is called a buried Markov model (BMM) because the
underlying Markov chain in an HMM is further hidden (buried) by
specific cross-observation dependencies. Gaussian mixture HMMs are
extended to represent BMM dependencies and new EM update equations are
derived. On preliminary experiments with a large-vocabulary
isolated-word speech database, BMMs are able to achieve an 11%
improvement in WER with only a 9.5% increase in the number of
parameters using a single state per mono-phone speech recognition
system.
|
|