Session: SPEECH-P13
Time: 1:00 - 3:00, Friday, May 11, 2001
Location: Exhibit Hall Area 7
Title: Language Modeling and Identification
Chair: Ciprian Chelba

1:00, SPEECH-P13.1
COMPOSITE BACKGROUND MODELS AND SCORE STANDARDIZATION FOR LANGUAGE IDENTIFICATION SYSTEMS
T. GLEASON, M. ZISSMAN
This paper describes two enhancements to our language identification system. Composite background (CBG) modeling allows us to identify target language speech in an environment where labeled background training data is unavailable or limited. Instead of separate models for each of the background languages, a single composite background model is created from all the non-target training speech. Generally, the CBG system performed about as well as a baseline system containing a separate model per background language. The average equal error rate for 12 CBG tests was 13.6% versus 13.4% for the baseline. We have also developed and tested a standardized confidence scoring function based on a single-layer perceptron which has proven to be capable of robust modeling of score distributions.

1:00, SPEECH-P13.2
IMPROVING TRIGRAM LANGUAGE MODELING WITH THE WORLD WIDE WEB
X. ZHU, R. ROSENFELD
We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set.

1:00, SPEECH-P13.3
DIALOG-CONTEXT DEPENDENT LANGUAGE MODELING COMBINING N-GRAMS AND STOCHASTIC CONTEXT FREE GRAMMARS
K. HACIOGLU, W. WARD
In this paper, we present our research on dialog dependent language modeling. In accordance with a speech (or sentence) production model in a discourse we split language modeling into two components;namely, dialog dependent concept modeling and syntactic modeling. The concept model is conditioned on the last question prompted by the dialog system and it is structured using n-grams. The syntactic model which consists of a collection of stochastic context free grammars one for each concept,describes word sequences that may be used to express the concepts. The resulting LM is evaluated by rescoring N-best lists. We report significant perplexity improvement with moderate word error rate drop within the contex of CU Communicator System; a dialog system for making travel plans by accessing information about flights, hotels, and car rentals.

1:00, SPEECH-P13.4
USE OF NON-NEGATIVE MATRIX FACTORIZATION FOR LANGUAGE MODEL ADAPTATION IN LECTURE TRANSCRIPTION TASK
M. NOVAK, R. MAMMONE
Use of Non-negative matrix factorization in Language Model adaptation is presented. This is an alternative approach to Latent Semantic Analysis based Language Modeling using Singular Value Decomposition (SVD). Potential benefits are discussed. A new method, which does not require an explicit document segmentation of the training corpus is presented as well. This method resulted in perplexity reduction of 16% on a database of biology lecture transcriptions.

1:00, SPEECH-P13.5
PORTABILITY OF SYNTACTIC STRUCTURE FOR LANGUAGE MODELING
C. CHELBA
The paper presents a study on the portability of statistical syntactic knowledge in the framework of the structured language model (SLM). We investigate the impact of porting SLM statistics from the Wall Street Journal (WSJ) to the Air Travel Information System (ATIS) domain. We compare this approach to applying the Microsoft rule-based parser for the ATIS data and to using a small amount of data manually parsed at UPenn for gathering the intial SLM statistics. Surprisingly, despite the fact that it performs modestly in perplexity, the model initialized on WSJ parses outperforms the other initialization methods based on in-domain annotated data, achieving a significant 0.4% absolute and 7% relative reduction in word error rate (WER) over a baseline system whose word error rate is 5.8%; the improvement measured relative to the minimum WER achievable on the N-best lists we worked with is 12%.

1:00, SPEECH-P13.6
EFFICIENT CLASS-BASED LANGUAGE MODELLING FOR VERY LARGE VOCABULARIES
E. WHITTAKER, P. WOODLAND
This paper investigates the perplexity and word error rate performance of two different forms of class model and the respective data-driven algorithms for obtaining automatic word classifications. The computational complexity of the algorithm for the `conventional' two-sided class model is found to be unsuitable for very large vocabularies (>100k) or large numbers of classes (>2000). A one-sided class model is therefore investigated and the complexity of its algorithm is found to be substantially less in such situations. Perplexity results are reported on both English and Russian data. For the latter both 65k and 430k vocabularies are used. Lattice rescoring experiments are also performed on an English language broadcast news task. These experimental results show that both models, when interpolated with a word model, perform similarly well. Moreover, classifications are obtained for the one-sided model in a fraction of the time required by the two-sided model, especially for very large vocabularies.

1:00, SPEECH-P13.7
DATA AUGMENTATION AND LANGUAGE MODEL ADAPTATION
D. JANISZEK, R. DE MORI, F. BECHET
A method is presented for augmenting word n-gram counts in a matrix which represents a 2-gram Language Model (LM). This method is based on numerical distances in a reduced space obtained by Singular Value Decomposition (SVD). Rescoring word lattices in a spoken dialogue application using an LM containing augmented counts has lead to a Word Error Rate (WER) reduction of 6.5%. By further interpolating augmented counts with the counts extracted from a very large newspaper corpus, but only for selected histories, a total WER reduction of 11.7% was obtained. We show that this approach gives better results than a global count interpolation for all histories of the LM.

1:00, SPEECH-P13.8
USING SEMANTIC CLASS INFORMATION FOR RAPID DEVELOPMENT OF LANGUAGE MODELS WITHIN ASR DIALOGUE SYSTEMS
E. FOSLER-LUSSIER, H. KUO
When dialogue system developers tackle a new domain, much effort is required; the development of different parts of the system usually proceeds independently. Yet it may be profitable to coordinate development efforts between different modules. Here, we focus our efforts on extending small amounts of language model training data by integrating semantic classes that were created for a natural language understanding module. By converting finite state parses of a training corpus into a probabilistic context free grammar and subsequently generating artificial data from the context free grammar, we can significantly reduce perplexity and ASR word error for situations with little training data. Experiments are presented using data from the ATIS and DARPA Communicator travel corpora.

1:00, SPEECH-P13.9
ON-LINE LEARNING OF LANGUAGE MODELS WITH WORD ERROR PROBABILITY DISTRIBUTIONS
R. GRETTER, G. RICCARDI
We are interested in the problem of learning stochastic language models on-line (without speech transcriptions) for adaptive speech recognition and understanding. In this paper we propose an algorithm to adapt to variations in the language model distributions based on the speech input only and without its true transcription. The on-line probability estimate is defined as a function of the prior and word error distributions. We show the effectiveness of word-lattice based error probability distributions in terms of Receiver Operating Characteristics (ROC) curves and word accuracy. We apply the new estimates P_{adapt}(w) to the task of adapting on-line an initial large vocabulary trigram language model and show improvement in word accuracy with respect to the baseline speech recognizer.

1:00, SPEECH-P13.10
CLASSES FOR FAST MAXIMUM ENTROPY TRAINING
J. GOODMAN
Maximum entropy models are considered by many to be one of the most promising avenues of language modeling research. Unfortunately, long training times make maximum entropy research difficult. We present a novel speedup technique: we change the form of the model to use classes. Our speedup works by creating two maximum entropy models, the first of which predicts the class of each word, and the second of which predicts the word itself. This factoring of the model leads to fewer non-zero indicator functions, and faster normalization, achieving speedups of up to a factor of 35 over one of the best previous techniques. It also results in typically slightly lower perplexities. The same trick can be used to speed training of other machine learning techniques, e.g. neural networks, applied to any problem with a large number of outputs, such as language modeling.