Authors:
Cristina Chesta,
Pietro Laface,
Franco Ravera,
Page (NA) Paper number 1465
Abstract:
In this paper we show that accurate HMMs for connected word recognition
can be obtained without context dependent modeling and discriminative
training. We train two HMMs for each word that have the same, standard,
left to right topology with the possibility of skipping one state,
but each model has a different number of states, automatically selected.
The two models account for different speaking rates that occur not
only in different utterances of the speakers, but also within a connected
word utterance of the same speaker. This simple modeling technique
has been applied to connected digit recognition using the adult speaker
portion of the TI/NIST corpus giving the best results reported so far
for this database. It has also been tested on telephone speech using
long sequences of Italian digits (credit card numbers), giving better
results with respect to classical models with a larger number of densities.
Authors:
Kishore A Papineni,
Page (NA) Paper number 2204
Abstract:
This paper presents a linear programming approach to discriminative
training. We first define a measure of discrimination of an arbitrary
conditional probability model on a set of labeled training data. We
consider maximizing discrimination on a parametric family of exponential
models that arises naturally in the maximum entropy framework. We show
that this optimization problem is globally convex in R^n, and is moreover
piece-wise linear on R^n. We propose a solution that involves solving
a series of linear programming problems. We provide a characterization
of global optimizers. We compare this framework with those of minimum
classification error and maximum entropy.
Authors:
Daniel Willett,
Christoph Neukirchen,
Jörg Rottland,
Gerhard Rigoll,
Page (NA) Paper number 1633
Abstract:
Decision tree-based state clustering has emerged in recent years as
the most popular approach for clustering the states of context dependent
hidden Markov model based speech recognizers. The application of sets
of phones, mainly phonetically motivated, that limit the possible clusters,
results in a reasonably good modeling of unseen phones while it still
enables to model specific phones very precisely whenever this is necessary
and enough training data is available. Formal Concept Analysis, a young
mathematical discipline, provides means for the treatment of sets and
sets of sets that are well suited for further improving tree-based
state clustering. The possible refinements are outlined and evaluated
in this paper. The major merit is the proposal of procedures for the
adaptation of the number of sets used for clustering to the amount
of available training data, and of a method that generates suitable
sets automatically without the incorporation of additional knowledge.
Authors:
Stavros Tsakalidis,
Vassilis Digalakis,
Leonardo G Neumeyer,
Page (NA) Paper number 2012
Abstract:
This paper introduces a new form of observation distributions for hidden
Markov models (HMMs), combining subvector quantization and mixtures
of discrete distributions. We present efficient training and decoding
algorithms for the discrete-mixture HMMs (DMHMMs). Our experimental
results in the air-travel information domain show that the high-level
of recognition accuracy of continuous mixture-density HMMs (CDHMMs)
can be maintained at significantly faster decoding speeds. Moreover,
we show that when the same number of mixture components is used in
DMHMMs and CDHMMs, the new models exhibit superior recognition performance.
Authors:
Wolfgang Reichl,
Wu Chou,
Page (NA) Paper number 2377
Abstract:
In this paper, a unified maximum likelihood framework of incorporating
phonetic and non-phonetic features in decision tree based acoustic
modeling is proposed. Unlike phonetic features, non-phonetic features
in this context are those features, which cannot be derived from the
phoneme identities. Although non-phonetic features are used in speech
recognition, they are often treated separately and based on various
heuristics. In our approach, non-phonetic features are included as
additional tags to the decision tree clustering. Moreover, the proposed
tagged decision tree is based on the full training data, and therefore,
it alleviates the problem of training data depletion in building specific
feature dependent acoustic models. Experimental results indicate that
up to 10% word error rate reduction can be achieved in a large vocabulary
(Wall Street Journal) speech recognition task based on the proposed
approach.
Authors:
Qiang Huo, Department of Computer Science and Information Systems, The University of Hong Kong, Pokfulam Road, Hong Kong (Hong Kong)
Bin Ma, Department of Computer Science and Information Systems, The University of Hong Kong, Pokfulam Road, Hong Kong (Hong Kong)
Page (NA) Paper number 1825
Abstract:
We propose to apply the concept of irrelevant variability normalization
to the general problem of learning structure from data. Because of
the problems of a diversified training data set and/or possible acoustic
mismatches between training and testing conditions, the structure learned
from the training data by using a maximum likelihood training method
will not necessarily generalize well on mismatched tasks. We apply
the above concept to the structural learning problem of phonetic decision-tree
based hidden Markov model (HMM) state tying. We present a new method
that integrates a linear-transformation based normalization mechanism
into the decision-tree construction process to make the learned structure
have a better modeling capability and generalizability. The viability
and efficacy of the proposed method are confirmed in a series of experiments
for continuous speech recognition of Mandarin Chinese.
Authors:
Philip McMahon, The Queen's University of Belfast, Northern Ireland (Ireland)
Naomi Harte, The Queen's University of Belfast, Northern Ireland (Ireland)
Saeed Vaseghi, The Queen's University of Belfast, Northern Ireland (Ireland)
Paul McCourt, The Queen's University of Belfast, Northern Ireland (Ireland)
Page (NA) Paper number 1649
Abstract:
Multi-resolution features, which are based on the premise that there
may be more cues for phonetic discrimination in a given sub-band than
in another, have been shown to outperform the standard MFCC feature
set for both classification and recognition tasks on the TIMIT database
[5]. This paper presents an investigation into possible strategies
to extend these ideas from the spectral domain into both the spectral
and temporal domains. Experimental work on the integration of segmental
models, which are better at capturing the longer term phonetic correlation
of a phonetic unit, into the discriminative multi-resolution framework
is presented. Results are presented which show that including this
supplementary temporal information offers an improvement performance
for the phoneme classification task over the standard multi-resolution
MFCC feature set with time derivatives appended. Possible strategies
for the extension of theses techniques into the area of continuous
speech recognition are discussed.
Authors:
Philip R Clarkson,
Pedro J Moreno,
Page (NA) Paper number 2104
Abstract:
Support Vector Machines (SVMs) represent a new approach to pattern
classification which has recently attracted a great deal of interest
in the machine learning community. Their appeal lies in their strong
connection to the underlying statistical learning theory, in particular
the theory of Structural Risk Minimization. SVMs have been shown to
be particularly successful in fields such as image identification and
face recognition; in many problems SVM classifiers have been shown
to perform much better than other non-linear classifiers such as artificial
neural networks and k-nearest neighbors. This paper explores the issues
involved in applying SVMs to phonetic classification as a first step
to speech recognition. We present results on several standard vowel
and phonetic classification tasks and show better performance than
Gaussian mixture classifiers. We also present an analysis of the difficulties
we foresee in applying SVMs to continuous speech recognition problems.
|