Chair: Roni Rosenfeld, Carnegie Mellon University (USA)
P. Srinivasa Rao, IBM T. J. Watson Research Center (USA)
Michael D. Monkowski, IBM T. J. Watson Research Center (USA)
Salim Roukos, IBM T. J. Watson Research Center (USA)
Statistical language models improve the performance of speech recognition systems by providing estimates of a priori probabilities of word sequences. The commonly used trigram language models obtain the conditional probability estimate of a word given the previous two words, from a large corpus of text. The text corpus is often a collection of several small diverse segments such as newspaper articles, or conversations on different topics. Knowledge of the current topic could be utilized to adapt the general trigram language models to match that topic closely. For example, an interpolation of the general language model with one built on the topic data could be used. We first discuss the adaptation of general trigram language models to a known topic using the minimum discrimination information (MDI) method. We then present results on the Switchboard corpus which consists of telephone conversations on several topics.
Masafumi Tamoto, NTT Basic Research Labs (JAPAN)
Takeshi Kawabata, NTT Basic Research Labs (JAPAN)
This paper describes a word clustering technique for stochastic language modeling and reports experimental evidence for its validity. The Binomial Posteriori Distribution (BPD) distance measurement between words is introduced. It is based on word co-occurrency and reliability. We plan to consider a practical application of this clustering technology by utilizing each cluster as a Markov state in the construction of a word prediction model.
Sabine Deligne, Telecom Paris (FRANCE)
Frederic Bimbot, Telecom Paris (FRANCE)
The multigram model assumes that language can be described as the output of a memoryless source that emits variable-length sequences of words. The estimation of the model parameters can be formulated as a Maximum Likelihood estimation problem from incomplete data. We show that estimates of the model parameters can be computed through an iterative Expectation-Maximization algorithm and we describe a forward-backward procedure for its implementation. We report the results of a systematical evaluation of multigrams for language modeling on the ATIS database. The objective performance measure is the test set perplexity. Our results show that multigrams outperform conventional n-grams for this task.
Harvey Lloyd-Thomas, Ensigma Limited
Jerry H. Wright, Ensigma Limited
Gareth J.F. Jones, University of Cambridge (UK)
This paper describes a language model in which context-free grammar rules are integrated into an n-gram framework, complementing it instead of attempting to replace it. This releases the grammar from the aim of parsing sentences overall (which is often undesirable as well as unrealistic), enabling it to be employed selectively in modelling phrases that are identifiable within a flow of speech. Algorithms for model training, and for sentence scoring and interpretation are described. All are based on the principle of summing over paths that span the sentence, but implementation is node-based for efficiency. Perplexity results for this system (using a hierarchy of grammars from empty to full-coverage) are compared with those for n-gram models, and the system is used for re-scoring N-best sentence lists for a speaker-independent recogniser.
Sheryl R. Young, Carnegie Mellon University (USA)
In real spoken language applications, where speakers interact spontaneously, there is much seeming unpredictability that makes recognition difficult. Multi-speaker spontaneous dialog where two speakers interact verbally to cooperatively solve a mutual, shared problem is more varied than human-computer interactions. Spontaneous speech is not well structured, exhibiting mid-utterance corrections and restarts in utterances. Discourse contains digressions, clarifications, corrections and topic changes. But, multi-speaker discourse is even more varied, with initiative effects, speakers interacting, planning and responding. This makes it extremely difficult to develop grammars and language models with adequate coverage and reliable stochastic parameters. Perplexity increases and recognition degrades considerably vis-a-vis human- database dialog. In spite of all this, multi-speaker dialogs are structured and predictable when the discourse is appropriately modelled. We have developed heuristics to model spontaneous speech and multi-speaker dialogs [4,8]. The underlying heuristics have been evaluated and shown to adequately and accurately predict discourse phenomena, as evaluated on a 10,000+ utterance corpus. Generally, the heuristics for computing discourse structure and deriving constraints from it are rule based. We have taken the rules and used them to develop a set of stochastic RTNs that capture both the rules and corpus probabilities. The resulting language model can be used predictively to dynamically generate stochastic utterance predictions or can be incorporated into any recognition/understanding system where a single prior state is maintained.
Reinhard Kneser, Philips GmbH Research Laboratories
Hermann Ney, RWTH Aachen University of Technology (GERMANY)
In stochastic language modeling, backing-off is a widely used method to cope with the sparse data problem. In case of unseen events this method backs off to a less specific distribution. In this paper we propose to use distributions which are especially optimized for the task of backing-off. Two different theoretical derivations lead to distributions which are quite different from the probability distributions that are usually used for backing-off. Experiments show an improvement of about 10% in terms of perplexity and 5% in terms of word error rate.
G. Bordel, Universidad del Pais Vasco
I. Torres, Universidad del Pais Vasco
E. Vidal, Universidad Politecnica de Valencia (SPAIN)
N-grams have been extensively and successfully used for Language Modelling in Continuous Speech Recognition tasks. On the other hand, it has been recently shown that K-testable Stochastic Languages (k-TS) are strictly equivalent to N- grams. A major problem to be solved when using a Language Model is the estimation of the probabilities of events not represented in the training corpus, i.e. unseen events. The aim of this work is to improve other well established smoothing procedures by interpolating models with different levels of complexity (Quality Weighted Interpolation - QWI). The effect of QWI was experimentally evaluated over a set of back-off smoothed k-TS language models. These experiments were carried out over several corpora using the test-set perplexity as an evaluation criterion. In all the cases the introduction of QWI resulted in a reduction of the test-set perplexity.
Daniel Jurafsky, University of California at Berkeley
Chuck Wooters, Department of Defense
Jonathan Segal, University of California at Berkeley
Andreas Stolcke, SRI International
Eric Fosler, University of California at Berkeley
Gary Tajchman, Voice Processing Corporation
Nelson Morgan, University of California at Berkeley (USA)
This paper describes a number of experiments in adding new grammatical knowledge to the Berkeley Restaurant Project (BeRP), our medium-vocabulary (1,300 word), speaker- independent, spontaneous continuous-speech understanding system (Jurafsky et al 1994). We describe an algorithm for using a probabilistic Earley parser and a stochastic context-free grammar (SCFG) to generate word transition probabilities at each frame for a Viterbi decoder. We show that using an SCFG as a language model improves word error rate from 34.6% (bigram) to 29.6% (SCFG), and semantic sentence recognition error from from 39.0% (bigram) to 34.1% (SCFG). In addition, we get a further reduction to 28.8% word error by mixing the bigram and SCFG LMs. We also report on our preliminary results from using discourse-context information in the LM.
Klaus Ries, University of Karlsruhe (GERMANY)
Finn Dag Bu, University of Karlsruhe (GERMANY)
Ye-Yi Wang, Carnegie Mellon University (USA)
Alex Waibel, University of Karlsruhe (GERMANY)
The perplexity of corpora is typically reduced by more than 30% compared to advanced n-gram models by a new method for the unsupervised acquisition of structural text models. This method is based on new algorithms for the classification of words and phrases from context and on new sequence finding procedures. These procedures are designed to work fast and accurately on small and large corpora. They are iterated to build a structural model of a corpus. The structural model can be applied to recalculate the scores of a speech recognizer and improves the word accuracy. Further applications such as preprocessing for neural networks and (hidden) markov models in language processing, which exploit the structure finding capabilities of this model, are proposed.
Claudia Pateras, McGill University (CANADA)
Gregory Dudek, McGill University (CANADA)
Renato De Mori, McGill University (CANADA)
In the domain of mobile robotic task execution under dialogue control, a primary goal is to identify the task target which is specified by a natural language description. A number of concepts are expressed in the user spoken language by vague terms like "the big box" and "very close to the door." We use fuzzy logic to map these vague terms onto the quantitative data collected by system sensors. Fuzziness may cause uncertainty in interpretation and, in particular, in understanding references. This uncertainty is abated by collecting additional information through queries to the user and autonomous sensing. Entropy is used to select the queries having the greatest discriminatory power among referent candidates. In addition, we examine the trade-off between querying, sensing and uncertainty. A framework to deal with each of these issues has been developed and will be presented.