SP-22.1

Topic Independent Language Model for Key-Phrase Detection and Verification
Tatsuya Kawahara, Shuji Doshita (Kyoto University)

A topic independent lexical and language modeling for robust key-phrase detection and verification is presented. Instead of assuming a domain specific lexicon and language model, our model is designed to characterize filler phrases depending on the speaking-style, thus can be trained with large corpora of different topics but the same style. Mutual information criterion is used to select topic independent filler words and their N-gram model is used for verification of key-phrase hypotheses. A dialogue-style dependent filler model improves the key-phrase detection in different dialogue applications. A lecture-style dependent model is trained with transcriptions of various oral presentations by filtering out topic specific words. It performs much better verification of key-phrases uttered during lectures of different topics compared with the conventional syllable-based model and large vocabulary model.

SP-22.2

A more efficient and optimal LLR for decoding and verification
Kwok Leung LAM, Pascale FUNG (University of Science and Technology (HKUST))

We propose a new confidence score for decoding and verification. Since the traditional log likelihood ratio (LLR) is borrowed from speaker verification technique, it may not be apropriate for decoding because we do not have a good modelling and definition of LLR for decoding/utterance verification. We have proposed a new formulation of LLR that can be used for decoding and verification task. Experimental results show that our proposed LLR can perform equally well compared with the result based on maximum likelihood in a decoding task. Also, we get an 5\% improvement in decoding compared with traditional LLR.

SP-22.3

Dynamic Classifier Combination in Hybrid Speech Recognition Systems using Utterance-Level Confidence Values
Katrin Kirchhoff (Technische Fakultaet, Universitaet Bielefeld), Jeff A Bilmes (ICSI/U.C. Berkeley)

A recent development in the hybrid HMM/ANN speech recognition paradigm is the use of several subword classifiers, each of which provides different information about the speech signal. Although the combining methods have obtained promising results, the strategies so far proposed have been relatively simple. In most cases frame-level subword unit probabilities are combined using an unweighted product or sum rule. In this paper, we argue and empirically demonstrate that the classifier combination approach can benefit from a dynamically weighted combination rule, where the weights are derived from higher-than-frame-level confidence values.

SP-22.4

Utterance Verification Using Prosodic Information for Mandarin Telephone Speech Keyword Spotting
Yeou-Jiunn Chen, Chung-Hsien Wu, Gwo-Lang Yan (Department of Computer Science and Information Engineering, National Cheng Kung University)

In this paper, the prosodic information, a very special and important feature in Mandarin speech, is used for Mandarin telephone speech utterance verification. A two-stage strategy, with recognition followed by verification, is adopted. For keyword recognition, 59 context-independent subsyllables, i.e., 22 INITIAL��s and 37 FINAL��s in Mandarin speech, and one background/silence model, are used as the basic recognition units. For utterance verification, 12 anti-subsyllable HMM��s, 175 context-dependent prosodic HMM��s, and five anti-prosodic HMM��s, are constructed. A keyword verification function combining phonetic-phase and prosodic-phase verification is investigated. Using a test set of 2400 conversational speech utterances from 20 speakers (12 males and 8 females), at 8.5% false rejection, the proposed verification method resulted in 17.8% false alarm rate. Furthermore, this method was able to correctly reject 90.4% of nonkeywords. Comparison with a baseline system without prosodic-phase verification shows that the prosodic information can benefit the verification performance.

SP-22.5

Error Correction for Speaker-Independent Isolated Word Recognition through Likelihood Compensation Using Phonetic Bigram
Hiroshi Matsuo (Akita University), Masaaki Ishigame (Iwate Prefectural University)

We propose an error correction technique for speaker-independent isolated word recognition by compensating for a word's likelihood. Likelihood is compensated for by likelihood calculated by a phonetic bigram. The phonetic bigram is a phoneme model expressing frame correlation within an utterance. A speaker-independent isolated word recognition experiment showed that our proposed technique reduces recognition error compared to conventional techniques. The proposed technique achieves performance almost equal that without speaker adaptation compared to the conventional phoneme model adapted using several words.

SP-22.6

Advances in confidence measures for large vocabulary
Andreas M Wendemuth, Georg Rose, Hans Dolfing (Philips Research Labs Aachen Germany)

This paper adresses the correct choice and combination of confidence measures in large vocabulary speech recognition tasks. We classify single words within continuous as well as large vocabulary utterances into two categories: utterances within the vocabulary which are recognized correctly, and other utterances, namely misrecognized utterances or (less frequent) out-of-vocabulary (OOV). To this end, we investigate the confidence error rate (CER) for several classes of confidence measures and transformations. In particular, we employed data-independent and data-dependent measures. The transformations we investigated include mapping to single confidence measures and linear combinations of these measures. These combinations are computed by means of neural networks trained with Bayes-optimal, and with Gardner-Derrida-optimal criteria. Compared to a recognition system without confidence measures, the selection of (various combinations of) confidence measures, the selection of suitable neural network architectures and training methods, continuously improves the CER.

SP-22.7

Hypothesis Dependent Threshold Setting for Improved Out-of-Vocabulary Data Rejection
Denis Jouvet, Katarina Bartkova, Guy Mercier (France T�l�com, CNET)

An efficient rejection procedure is necessary to reject out-of-vocabulary words and noise tokens that occur in voice activated vocal services. Garbage or filler models are very useful for such a task. However, a post-processing of the recognized hypothesis, based on a likelihood ratio statistic test, can refine the decision and improve performance. These tests can be applied either on acoustic parameters or on phonetic or prosodic parameters that are not taken into account by the HMM-based decoder. This paper focuses on the post-processing procedure and shows that making the likelihood ratio decision threshold dependent on the recognized hypothesis largely improves the efficiency of the rejection procedure. Models and anti-models are one of the key-points of such an approach. Their training and usage are also discussed, as well as the contextual modeling involved. Finally results are reported on a field database collected from a 2000-word directory task using various phonetic and prosodic parameters.

SP-22.8

Buried Markov Models for Speech Recognition
Jeff A Bilmes (U.C. Berkeley/ICSI)

Good HMM-based speech recognition performance requires at most minimal inaccuracies to be introduced by HMM conditional independence assumptions. In this work, HMM conditional independence assumptions are relaxed in a principled way. For each hidden state value, additional dependencies are added between observation elements to increase both accuracy and discriminability. These additional dependencies are chosen according to natural statistical dependencies extant in training data that are not well modeled by an HMM. The result is called a buried Markov model (BMM) because the underlying Markov chain in an HMM is further hidden (buried) by specific cross-observation dependencies. Gaussian mixture HMMs are extended to represent BMM dependencies and new EM update equations are derived. On preliminary experiments with a large-vocabulary isolated-word speech database, BMMs are able to achieve an 11% improvement in WER with only a 9.5% increase in the number of parameters using a single state per mono-phone speech recognition system.

< SP-21 SP-23 >

Last Update: February 4, 1999 Ingo Höntsch