Authors:
Volker Warnke,
Stefan Harbeck,
Elmar Nöth,
Heinrich Niemann,
Michael Levit,
Page (NA) Paper number 1233
Abstract:
In this paper we present a new approach for estimating the interpolation
parameters of language models (LM) which are used as classifiers. With
the classical maximum likelihood (ML) estimation theoretically one
needs to have a huge amount of data and the fundamental density assumption
has to be correct. Usually one of these conditions is violated, so
different optimization techniques like maximum mutual information (MMI)
and minimum classification error (MCE) can be used instead, where the
interpolation parameters are not optimized on its own but in consideration
of all models together. In this paper we present how MCE and MMI techniques
can be applied to two different kind of interpolation strategies: the
linear interpolation, which is the standard interpolation method and
the rational interpolation. We compare ML, MCE and MMI on the German
part of the VERBMOBIL corpus, where we get a reduction of 3% of classification
error when discriminating between 18 dialog act classes.
Authors:
Reinhard Blasig,
Page (NA) Paper number 1520
Abstract:
This paper presents a new kind of language models: category/word varigrams.
This special model type permits a tight integration of word-based and
category-based modeling of word sequences. Any succession of words
and word categories may be employed to describe a given word history.
This provides a much greater flexibility than previous combinations
of word-based and category-based language models. Experiments on the
WSJ0 corpus and the 1994 ARPA evaluation data indicate that the category/word
varigram yields a perplexity reduction of up to 10 percent as compared
to a word varigram of the same size, and improves the word error rate
(WER) by 7 percent. Compared to a linear interpolation of a word-based
and a category-based n-gram, the WER improvement is about 4 percent.
Authors:
Hirofumi Yamamoto,
Yoshinori Sagisaka,
Page (NA) Paper number 1646
Abstract:
A new word-clustering technique is proposed to efficiently build statistically
salient class 2-grams from language corpora. By splitting word neighboring
characteristics into word-preceding and following directions, multiple
(two-dimensional) word classes are assigned to each word. In each side,
word classes are merged into larger clusters independently according
to preceding or following word distributions. This word-clustering
can provide more efficient and statistically reliable word clusters.
Further, we extend it to Multi-Class Composite N-gram that unit is
Multi-Class 2-gram and joined word. Multi-Class Composite N-gram showed
better performance both in perplexity and recognition rates with one
thousandth smaller size than conventional word 2-grams.
Authors:
Christer Samuelsson,
Wolfgang Reichl,
Page (NA) Paper number 1781
Abstract:
A novel approach is presented to class-based language modeling based
on part-of-speech statistics. It uses a deterministic word-to-class
mapping, which handles words with alternative part-of-speech assignments
through the use of ambiguity classes. The predictive power of word-based
language models and the generalization capability of class-based language
models are combined using both linear interpolation and word-to-class
backoff, and both methods are evaluated. Since each word belongs to
one precisely ambiguity class, an exact word-to-class backoff model
can easily be constructed. Empirical evaluations on large-vocabulary
speech-recognition tasks show perplexity improvements and significant
reductions in word error-rate.
Authors:
Milind Mahajan,
Doug Beeferman,
X.D. Huang,
Page (NA) Paper number 2391
Abstract:
N-gram language models are frequently used by the speech recognition
systems to constrain and guide the search. N-gram models use only the
last N-1 words to predict the next word. Typical values of N that are
used range from 2-4. N-gram language models thus lack the long-term
context information. We show that the predictive power of the N-gram
language models can be improved by using long-term context information
about the topic of discussion. We use information retrieval techniques
to generalize the available context information for topic-dependent
language modeling. We demonstrate the effectiveness of this technique
by performing experiments on the Wall Street Journal text corpus, which
is a relatively difficult task for topic-dependent language modeling
since the text is relatively homogeneous. The proposed method can reduce
the perplexity of the baseline language model by 37%, indicating the
predictive power of the topic-dependent language model.
Authors:
Sven C Martin, Lehrstuhl fuer Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany (Germany)
Hermann Ney, Lehrstuhl fuer Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany (Germany)
Joerg Zaplo, Lehrstuhl fuer Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany (Germany)
Page (NA) Paper number 1703
Abstract:
This paper discusses various aspects of smoothing techniques in maximum
entropy language modeling, a topic not sufficiently covered by previous
publications. We show (1) that straightforward maximum entropy models
with nested features, e.g. tri-, bi-, and unigrams, result in unsmoothed
relative frequencies models; (2) that maximum entropy models with nested
features and discounted feature counts approximate backing-off smoothed
relative frequencies models with Kneser's advanced marginal back-off
distribution; this explains some of the reported success of maximum
entropy models in the past; (3) perplexity results for nested and non-nested
features, e.g. trigrams and distance-trigrams, on a 4-million word
subset of the Wall Street Journal Corpus, showing that the smoothing
method has more effect on the perplexity than the method to combine
information.
Authors:
Stanley F. Chen,
Ronald Rosenfeld,
Page (NA) Paper number 2189
Abstract:
Conditional Maximum Entropy models have been successfully applied to
estimating language model probabilities of the form p(w|h), but are
often too demanding computationally. Furthermore, the conditional framework
does not lend itself to expressing global sentential phenomena. We
have recently introduced a non-conditional Maximum Entropy language
model which directly models the probability of an entire sentence or
utterance. The model treats each utterance as a "bag of features,"
where features are arbitrary computable properties of the sentence.
Using the model is computationally straightforward since it does not
require normalization. Training the model requires efficient sampling
of sentences from an exponential distribution. In this paper, we further
develop the model and demonstrate its feasibility and power. We compare
the efficiency of several sampling techniques, implement smoothing
to accommodate rare features, and suggest an efficient algorithm for
improving convergence rate. We then present a novel procedure for feature
selection, which exploits discrepancies between the existing model
and the training corpus. We demonstrate our ideas by constructing and
analyzing competitive models in the Switchboard domain.
Authors:
Sanjeev P Khudanpur,
Jun Wu,
Page (NA) Paper number 2192
Abstract:
A compact language model which incorporates local dependencies in the
form of N-grams and long distance dependencies through dynamic topic
conditional constraints is presented. These constraints are integrated
using the maximum entropy principle. Issues in assigning a topic to
a test utterance are investigated. Recognition results on the Switchboard
corpus are presented showing that with a very small increase in the
number of model parameters, reduction in word error rate and language
model perplexity are achieved over trigram models. Some analysis follows,
demonstrating that the gains are even larger on content-bearing words.
The results are compared with those obtained by interpolating topic-independent
and topic-specific N-gram models. The framework presented here extends
easily to incorporate other forms of statistical dependencies such
as syntactic word-pair relationships or hierarchical topic constraints.
|