Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-18 |
|
SP-18.1
|
Connected Digit Recognition Using Short and Long Duration Models
Cristina Chesta,
Pietro Laface (Politecnico di Torino),
Franco Ravera (CSELT, Torino)
In this paper we show that accurate HMMs for connected word recognition
can be obtained without context dependent modeling and discriminative training.
We train two HMMs for each word that have the same, standard, left to right
topology with the possibility of skipping one state, but each model has a
different number of states, automatically selected.
The two models account for different speaking rates that
occur not only in different utterances of the speakers,
but also within a connected word utterance of the same speaker.
This simple modeling technique has been
applied to connected digit recognition using the adult speaker portion of the
TI/NIST corpus giving the best results reported so far for this database.
It has also been tested on telephone speech using long sequences of Italian
digits (credit card numbers), giving better results with respect to classical
models with a larger number of densities.
|
SP-18.2
|
Discriminative Training Via Linear Programming
Kishore A Papineni (IBM T. J. Watson Research Center)
This paper presents a linear programming approach to discriminative
training. We first define a measure of discrimination of an arbitrary
conditional probability model on a set of labeled training data. We
consider maximizing discrimination on a parametric family of
exponential models that arises naturally in the maximum entropy
framework. We show that this optimization problem is globally convex
in $R^n$, and is moreover piece-wise linear on $R^n$. We propose a
solution that involves solving a series of linear programming
problems. We provide a characterization of global optimizers. We
compare this framework with those of minimum classification error and
maximum entropy.
|
SP-18.3
|
Refining Tree-Based Clustering by Means of Formal Concept Analysis, Balanced Decision Trees and Automatically Generated Model-Sets
Daniel Willett,
Christoph Neukirchen,
Jörg Rottland,
Gerhard Rigoll (Gerhard-Mercator-University Duisburg)
Decision tree-based state clustering has emerged in recent years as the
most popular approach for clustering the states of context dependent hidden
Markov model based speech recognizers.
The application of sets of phones, mainly phonetically motivated,
that limit the possible clusters, results in a reasonably good modeling of
unseen phones while it still enables to model specific phones very
precisely whenever this is necessary and enough training data is available.
Formal Concept Analysis, a young mathematical discipline,
provides means for the treatment of sets and sets of sets that are well suited
for further improving tree-based state clustering.
The possible refinements are outlined and evaluated in this
paper. The major merit is the proposal of procedures for the adaptation of
the number of sets used for clustering to the amount of available
training data, and of a method that generates suitable sets automatically
without the incorporation of additional knowledge.
|
SP-18.4
|
EFFICIENT SPEECH RECOGNITION USING SUBVECTOR
QUANTIZATION AND DISCRETE-MIXTURE HMMS
Stavros Tsakalidis (Technical University of Crete),
Vassilios Digalakis (Technical University of Crete / SRI International),
Leonardo G Neumeyer (SRI International)
This paper introduces a new form of observation distributions for hidden Markov models (HMMs), combining subvector quantization and mixtures of discrete distributions.
We present efficient training and decoding algorithms for the discrete-mixture HMMs (DMHMMs). Our experimental results in the air-travel information domain show that
the high-level of recognition accuracy of continuous mixture-density HMMs (CDHMMs) can be maintained at significantly faster decoding speeds. Moreover, we show that
when the same number of mixture components is used in DMHMMs and CDHMMs, the new models exhibit superior recognition performance.
|
SP-18.5
|
A UNIFIED APPROACH OF INCORPORATING GENERAL FEATURES IN DECISION TREE BASED ACOUSTIC MODELING
Wolfgang Reichl,
Wu Chou (Bell Laboratories, Lucent Technologies)
In this paper, a unified maximum likelihood framework
of incorporating phonetic and non-phonetic features
in decision tree based acoustic modeling is proposed.
Unlike phonetic features, non-phonetic features in
this context are those features, which cannot be
derived from the phoneme identities. Although
non-phonetic features are used in speech recognition,
they are often treated separately and based on various
heuristics. In our approach, non-phonetic features are
included as additional tags to the decision tree
clustering. Moreover, the proposed tagged decision
tree is based on the full training data, and therefore,
it alleviates the problem of training data depletion
in building specific feature dependent acoustic models.
Experimental results indicate that up to 10% word error
rate reduction can be achieved in a large vocabulary
(Wall Street Journal) speech recognition task based on
the proposed approach.
|
SP-18.6
|
Irrelevant Variability Normalization in Learning HMM State Tying From Data Based on Phonetic Decision-Tree
Qiang Huo,
Bin Ma (Department of Computer Science and Information Systems, The University of Hong Kong, Pokfulam Road, Hong Kong)
We propose to apply the concept of irrelevant variability normalization
to the general problem of learning structure from data. Because of the
problems of a diversified training data set and/or possible acoustic
mismatches between training and testing conditions, the structure learned
from the training data by using a maximum likelihood training method will
not necessarily generalize well on mismatched tasks. We apply the above
concept to the structural learning problem of phonetic decision-tree based
hidden Markov model (HMM) state tying. We present a new method that
integrates a linear-transformation based normalization mechanism into the
decision-tree construction process to make the learned structure have a
better modeling capability and generalizability. The viability and efficacy of the
proposed method are confirmed in a series of experiments for continuous
speech recognition of Mandarin Chinese.
|
SP-18.7
|
DISCRIMINATIVE SPECTRAL-TEMPORAL MULTI-RESOLUTION FEATURES FOR SPEECH RECOGNITION
Philip McMahon,
Naomi Harte,
Saeed Vaseghi,
Paul McCourt (The Queen’s University of Belfast, Northern Ireland)
Multi-resolution features, which are based on the premise that there may be more cues for phonetic discrimination in a given sub-band than in another, have been shown to outperform the standard MFCC feature set for both classification and recognition tasks on the TIMIT database [5]. This paper presents an investigation into possible strategies to extend these ideas from the spectral domain into both the spectral and temporal domains. Experimental work on the integration of segmental models, which are better at capturing the longer term phonetic correlation of a phonetic unit, into the discriminative multi-resolution framework is presented. Results are presented which show that including this supplementary temporal information offers an improvement performance for the phoneme classification task over the standard multi-resolution MFCC feature set with time derivatives appended. Possible strategies for the extension of theses techniques into the area of continuous speech recognition are discussed.
|
SP-18.8
|
On the use of Support Vector Machines for Phonetic Classification
Philip R Clarkson (Cambridge University Engineering Department.),
Pedro J Moreno (Compaq Computer Corporation, Cambridge Research Lab)
Support Vector Machines (SVMs) represent a new approach to pattern classification which has recently attracted a great deal of interest in the machine learning community. Their appeal lies in their strong connection to the underlying statistical learning theory, in particular the theory of Structural Risk Minimization. SVMs have been shown to be particularly successful in fields such as image identification and
face recognition; in many problems SVM classifiers have been shown to perform much better than other non-linear classifiers such as artificial neural networks and $k$-nearest neighbors.
This paper explores the issues involved in applying SVMs to phonetic classification as a first step to speech recognition. We present results on several standard vowel and phonetic classification tasks and show better performance than Gaussian mixture classifiers. We also present an analysis of the difficulties we foresee in applying SVMs to continuous speech recognition problems.
|
|