Chair: Jay Wilpon, AT&T Bell Laboratories (USA)
R.C. Rose, AT&T Bell Laboratories (USA)
B.H. Juang, AT&T Bell Laboratories (USA)
C.H. Lee, AT&T Bell Laboratories (USA)
A procedure is proposed for verifying the occurrence of string hypotheses produced by a hidden Markov model (HMM) based continuous speech recognizer. Most existing procedures verify word hypotheses through likelihood ratio scoring procedures computed using ad hoc approximations for the density of the alternative hypothesis in the denominator of the likelihood ratio statistic. The discriminative training procedure described in this paper attempts to adjust the parameters of the null hypothesis and the alternate hypothesis models to increase the power of a hypothesis test for utterance verification. The training procedure was evaluated for its ability to detect a twenty word vocabulary in a subset of the Switchboard conversational speech corpus. Experimental results show that the use of this procedure results in significant improvement in the word verification operating characteristic, as well as an improvement in the overall system performance.
Mazin G. Rahim, AT&T Bell Laboratories (USA)
Chin-Hui Lee, AT&T Bell Laboratories (USA)
Biing-Hwang Juang, AT&T Bell Laboratories (USA)
Utterance verification represents an important technology in the design of user-friendly speech recognition systems. This paper addresses the issue of robustness in utterance verification. Four different approaches to robustness have been investigated: a string-based likelihood measure for the detection of non-vocabulary words and "putative" errors, a signal bias removal method for channel normalization, on-line adaptation technique for achieving desirable trade-off between false rejection and false alarms, and a discriminative training method for the minimization of the expected string error rate. When these techniques were all integrated into a state-of-the- art connected digit recognition system, the string error rate was found to decrease by up to 57% at a rejection rate of 5%. For non-vocabulary word strings, the proposed utterance verification system rejected over 99.9% of extraneous speech.
Hiroshi Kanazawa, Toshiba Corporation (JAPAN)
Mitsuyoshi Tachimori, Toshiba Corporation (JAPAN)
Yoichi Takebayashi, Toshiba Corporation (JAPAN)
We have proposed a new wordspotting method, combining word-based pattern matching and phoneme-based HMM. Word-based pattern matching based on the time-frequency representation of a whole word pattern is robust against pattern variations and background noise, while the phoneme-based HMM, which represents phonemic features within a word pattern, is flexible for expanding the vocabulary. Because of the difference in scope, these two have their own characteristics in terms of robustness and accuracy. To take advantage of the features of these two, we have integrated these different types of wordspotting results under a unified criterion. A syntactic and semantic parser is also utilized to prune the wordspotting results for spontaneous speech understanding. Experimental results indicate the effectiveness of the proposed method.
T. Schultz, University of Karlsruhe (GERMANY) and Carnegie Mellon University (USA)
I. Rogina, University of Karlsruhe (GERMANY) and Carnegie Mellon University (USA)
In this paper several improvements of our speech-to-speech translation system JANUS-2 on spontaneous human-to-human dialogs are presented. Common phenomena in spontaneous speech are described, followed by a classification of different types of noises. To handle the variety of spontaneous effects in human-to-human dialogs, special noise models are introduced representing both human and nonhuman noises, as well as word fragments. It will be shown that both the acoustic and the language modeling of these noises increase the recognition performance significantly. In the experiments, a clustering of the noise classes is performed and the resulting cluster variants are compared, thus allowing to determine the best tradeoff between sensitivity and trainability of the models.
Mitchel Weintraub, SRI International (USA)
A new scoring algorithm has been developed for generating wordspotting hypotheses and their associated scores. This technique uses a large-vocabulary continuous speech recognition (LVCSR) system to generate the N-best answers along with their Viterbi alignments. The score for a putative hit is computed by summing the likelihoods for all hypotheses that contain the keyword normalized by dividing by the sum of all hypothesis likelihoods in the N-best list. Using a test set of conversational speech from Switchboard Credit Card conversations, we achieved an 81% figure of merit (FOM). Our word recognition error rate on this same test set is 54.7%.
Chakib Tadj, Telecom Paris (FRANCE)
Franck Poirier, Telecom Paris (FRANCE)
In this paper, we present a novel hybrid keyword spotting system that combines supervised and unsupervised competitive learning algorithms. The first stage is a SOFM module which is specifically designed for discriminating between keywords (KWs) and non-keywords (NKWs). The second stage is a FDVQ (Fuzzy Dynamic Vector Quantization) module which consists of discriminating between KWs detected by the first stage processing. As the FDVQ was not designed to represent the acoustic garbage models, our standard FDVQ based keyword spotter system was based on some threshold considerations to reject the NKWs models. This conduct us to introduce on upstream a SOFM module which is designed for this task. The results show an improvement of about 9% on the accuracy of the system comparing to our standard one.
Stephen V. Kosonocky, IBM T.J. Watson Research Center
Richard J. Mammone, Rutgers University (USA)
A new classifier is described that combines the discriminatory ability of the neural tree network (NTN) with the Gaussian mixture model to create a continuous density neural tree network (CDNTN). The (CDNTN) is used within a Hidden Markov Model (HMM), along with a nonparametric state duration model to construct a continuous word spotting system for real time applications. The new word spotting system does not use a general background model, allowing construction of independent models whose performance is independent of the number of models in the recognition system, supporting a direct parallel implementation. Although HMM word spotting systems are shown to provide good performance when sufficient training data is available, for applications where background speech data is not available or only a limited numbers of training tokens are available, the CDNTN word spotting system is shown to outperform comparable HMM systems.
G.J.F. Jones, Cambridge University (UK)
J.T. Foote, Cambridge University (UK)
K. Sparck Jones, Cambridge University (UK)
S.J. Young, Cambridge University (UK)
The goal of the Video Mail Retrieval project is to integrate state-of-the-art document retrieval methods with high accuracy word spotting to yield a robust and efficient retrieval system. This paper describes a preliminary study to determine the extent to which retrieval precision is affected by word spotting performance. It includes a description of the database design, the word spotting algorithm, and the information retrieval method used. Results are presented which show audio retrieval performance very close to that of text.
Jerry H. Wright, Ensigma Limited (UK)
Michael J. Carey, Ensigma Limited (UK)
Eluned S. Parris, Ensigma Limited (UK)
Keywords are chosen on the basis of their usefulness for discriminating a topic from background speech. Good topic recognition can be achieved with a small set of well-chosen keywords, but particular combinations of keywords often achieve better discrimination than can be accounted for by regarding them as independent. This paper describes a higher-order statistical approach involving models of keyword-topic interdependence. A linear-logistic model brings some improvement in performance, but better results are obtained using log-linear contingency table models. Although the potential number of these is very large, good models tend to be simple and are suggested by heuristic measures inferred from the training data. The approach is tested using a broadcast radio database.
Takeshi Kawabata, NTT Basic Research Labs (JAPAN)
This paper describes a new stochastic topic focusing mechanism for reducing the perplexity of natural spoken languages. In this mechanism, a predictive context-free grammar (CFG) parser analyzes input speech and generates grammar-rule sequences. These rule sequences drive a hidden Markov model (HMM), and the current topic is estimated as the HMM state distribution. The CFG rule probabilities are dynamically changed according to this topic state distribution. Evaluation of this mechanism using a large dialog text database confirms that it can effectively reduce the task perplexity.