Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-17 |
|
SP-17.1
|
Discriminative Estimation of Interpolation Parameters for Language Model Classifiers
Volker Warnke,
Stefan Harbeck,
Elmar Noeth,
Heinrich Niemann,
Michael Levit (University Erlangen-Nuremberg)
In this paper we present a new approach for estimating the
interpolation parameters of language models (LM) which are
used as classifiers. With the classical maximum likelihood
(ML) estimation theoretically one needs to have a huge amount
of data and the fundamental density assumption has to be
correct. Usually one of these conditions is violated, so
different optimization techniques like maximum mutual
information (MMI) and minimum classification error (MCE) can
be used instead, where the interpolation parameters are not
optimized on its own but in consideration of all models
together. In this paper we present how MCE and MMI techniques
can be applied to two different kind of interpolation
strategies: the linear interpolation, which is the standard
interpolation method and the rational interpolation. We
compare ML, MCE and MMI on the German part of the VERBMOBIL
corpus, where we get a reduction of 3% of classification
error when discriminating between 18 dialog act classes.
|
SP-17.2
|
Combination of Words and Word Categories in Varigram Histories
Reinhard Blasig (Philips Research Laboratories)
This paper presents a new kind of language models:
category/word varigrams. This special model type
permits a tight integration of word-based and
category-based modeling of word sequences. Any
succession of words and word categories may be
employed to describe a given word history. This
provides a much greater flexibility than previous
combinations of word-based and category-based language
models.
Experiments on the WSJ0 corpus and the 1994 ARPA
evaluation data indicate that the category/word
varigram yields a perplexity reduction of up to 10
percent as compared to a word varigram of the same
size, and improves the word error rate (WER) by 7
percent. Compared to a linear interpolation of a
word-based and a category-based n-gram, the WER
improvement is about 4 percent.
|
SP-17.3
|
Multi-Class Composite N-gram Based on Connection Direction
Hirofumi Yamamoto,
Yoshinori Sagisaka (ATR-ITL)
A new word-clustering technique is proposed to
efficiently build statistically salient class
2-grams from language corpora. By splitting word
neighboring characteristics into word-preceding and
following directions, multiple (two-dimensional) word
classes are assigned to each word. In each side, word
classes are merged into larger clusters independently
according to preceding or following word distributions.
This word-clustering can provide more efficient and
statistically reliable word clusters. Further, we
extend it to Multi-Class Composite N-gram that unit is
Multi-Class 2-gram and joined word. Multi-Class
Composite N-gram showed better performance both in
perplexity and recognition rates with one thousandth
smaller size than conventional word 2-grams.
|
SP-17.4
|
A Class-based Language Model for Large-vocabulary Speech Recognition Extracted from Part-of-Speech Statistics
Christer Samuelsson,
Wolfgang Reichl (Lucent Technologies)
A novel approach is presented to class-based language
modeling based on part-of-speech statistics.
It uses a deterministic word-to-class mapping,
which handles words with alternative part-of-speech
assignments through the use of ambiguity classes.
The predictive power of word-based language models
and the generalization capability of class-based
language models are combined using both linear
interpolation and word-to-class backoff, and both
methods are evaluated. Since each word belongs to
one precisely ambiguity class, an exact word-to-class
backoff model can easily be constructed. Empirical
evaluations on large-vocabulary speech-recognition
tasks show perplexity improvements and significant
reductions in word error-rate.
|
SP-17.5
|
Improved Topic-Dependent Language Modeling using Information Retrieval Techniques
Milind Mahajan (Microsoft Research),
Doug Beeferman (Carnegie Mellon University),
X.D. Huang (Microsoft Research)
N-gram language models are frequently used by the speech recognition systems to constrain and guide the search. N-gram models use only the last N-1 words to predict the next word. Typical values of N that are used range from 2-4. N-gram language models thus lack the long-term context information. We show that the predictive power of the N-gram language models can be improved by using long-term context information about the topic of discussion. We use information retrieval techniques to generalize the available context information for topic-dependent language modeling. We demonstrate the effectiveness of this technique by performing experiments on the Wall Street Journal text corpus, which is a relatively difficult task for topic-dependent language modeling since the text is relatively homogeneous. The proposed method can reduce the perplexity of the baseline language model by 37%, indicating the predictive power of the topic-dependent language model.
|
SP-17.6
|
Smoothing Methods in Maximum Entropy Language Modeling
Sven C Martin,
Hermann Ney,
Joerg Zaplo (Lehrstuhl fuer Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany)
This paper discusses various aspects of smoothing
techniques in maximum entropy language modeling, a
topic not
sufficiently covered
by previous publications. We show
(1) that straightforward
maximum entropy models with nested features, e.g.
tri-, bi-, and
unigrams, result in unsmoothed relative frequencies
models;
(2) that maximum entropy models with nested
features and discounted
feature counts approximate backing-off smoothed
relative frequencies models
with Kneser's advanced marginal back-off distribution;
this explains
some of the reported success of maximum entropy
models in the past;
(3) perplexity results for nested and non-nested
features,
e.g. trigrams and distance-trigrams, on a 4-million
word subset of the
Wall Street Journal Corpus, showing that the
smoothing method has more
effect on the perplexity than the method to combine
information.
|
SP-17.7
|
Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models
Stanley Chen,
Ronald Rosenfeld (Carnegie Mellon University)
Conditional Maximum Entropy models have been successfully applied to
estimating language model probabilities of the form p(w|h), but are
often too demanding computationally. Furthermore, the conditional
framework does not lend itself to expressing global sentential
phenomena. We have recently introduced a non-conditional Maximum
Entropy language model which directly models the probability of an
entire sentence or utterance. The model treats each utterance as a
"bag of features," where features are arbitrary computable
properties of the sentence. Using the model is computationally
straightforward since it does not require normalization. Training the
model requires efficient sampling of sentences from an exponential
distribution.
In this paper, we further develop the model and demonstrate its
feasibility and power. We compare the efficiency of several sampling
techniques, implement smoothing to accommodate rare features, and
suggest an efficient algorithm for improving convergence rate. We
then present a novel procedure for feature selection, which exploits
discrepancies between the existing model and the training corpus. We
demonstrate our ideas by constructing and analyzing competitive models
in the Switchboard domain.
|
SP-17.8
|
A Maximum Entropy Language Model Integrating N-Gram and Topic Dependencies for Conversational Speech Recognition
Sanjeev P Khudanpur (Center Johns Hopkins University),
Jun Wu (Johns Hopkins University)
A compact language model which incorporates local dependencies in the
form of N-grams and long distance dependencies through dynamic topic
conditional constraints is presented. These constraints are
integrated using the maximum entropy principle. Issues in assigning a
topic to a test utterance are investigated. Recognition results on
the Switchboard corpus are presented showing that with a very small
increase in the number of model parameters, reduction in word error
rate and language model perplexity are achieved over trigram models.
Some analysis follows, demonstrating that the gains are even larger on
content-bearing words. The results are compared with those obtained
by interpolating topic-independent and topic-specific N-gram models.
The framework presented here extends easily to incorporate other forms
of statistical dependencies such as syntactic word-pair relationships
or hierarchical topic constraints.
|
|