SP-18.1

Connected Digit Recognition Using Short and Long Duration Models
Cristina Chesta, Pietro Laface (Politecnico di Torino), Franco Ravera (CSELT, Torino)

In this paper we show that accurate HMMs for connected word recognition can be obtained without context dependent modeling and discriminative training. We train two HMMs for each word that have the same, standard, left to right topology with the possibility of skipping one state, but each model has a different number of states, automatically selected. The two models account for different speaking rates that occur not only in different utterances of the speakers, but also within a connected word utterance of the same speaker. This simple modeling technique has been applied to connected digit recognition using the adult speaker portion of the TI/NIST corpus giving the best results reported so far for this database. It has also been tested on telephone speech using long sequences of Italian digits (credit card numbers), giving better results with respect to classical models with a larger number of densities.

SP-18.2

Discriminative Training Via Linear Programming
Kishore A Papineni (IBM T. J. Watson Research Center)

This paper presents a linear programming approach to discriminative training. We first define a measure of discrimination of an arbitrary conditional probability model on a set of labeled training data. We consider maximizing discrimination on a parametric family of exponential models that arises naturally in the maximum entropy framework. We show that this optimization problem is globally convex in $R^n$, and is moreover piece-wise linear on $R^n$. We propose a solution that involves solving a series of linear programming problems. We provide a characterization of global optimizers. We compare this framework with those of minimum classification error and maximum entropy.

SP-18.3

Refining Tree-Based Clustering by Means of Formal Concept Analysis, Balanced Decision Trees and Automatically Generated Model-Sets
Daniel Willett, Christoph Neukirchen, J�rg Rottland, Gerhard Rigoll (Gerhard-Mercator-University Duisburg)

Decision tree-based state clustering has emerged in recent years as the most popular approach for clustering the states of context dependent hidden Markov model based speech recognizers. The application of sets of phones, mainly phonetically motivated, that limit the possible clusters, results in a reasonably good modeling of unseen phones while it still enables to model specific phones very precisely whenever this is necessary and enough training data is available. Formal Concept Analysis, a young mathematical discipline, provides means for the treatment of sets and sets of sets that are well suited for further improving tree-based state clustering. The possible refinements are outlined and evaluated in this paper. The major merit is the proposal of procedures for the adaptation of the number of sets used for clustering to the amount of available training data, and of a method that generates suitable sets automatically without the incorporation of additional knowledge.

SP-18.4

EFFICIENT SPEECH RECOGNITION USING SUBVECTOR QUANTIZATION AND DISCRETE-MIXTURE HMMS
Stavros Tsakalidis (Technical University of Crete), Vassilios Digalakis (Technical University of Crete / SRI International), Leonardo G Neumeyer (SRI International)

This paper introduces a new form of observation distributions for hidden Markov models (HMMs), combining subvector quantization and mixtures of discrete distributions. We present efficient training and decoding algorithms for the discrete-mixture HMMs (DMHMMs). Our experimental results in the air-travel information domain show that the high-level of recognition accuracy of continuous mixture-density HMMs (CDHMMs) can be maintained at significantly faster decoding speeds. Moreover, we show that when the same number of mixture components is used in DMHMMs and CDHMMs, the new models exhibit superior recognition performance.

SP-18.5

A UNIFIED APPROACH OF INCORPORATING GENERAL FEATURES IN DECISION TREE BASED ACOUSTIC MODELING
Wolfgang Reichl, Wu Chou (Bell Laboratories, Lucent Technologies)

In this paper, a unified maximum likelihood framework of incorporating phonetic and non-phonetic features in decision tree based acoustic modeling is proposed. Unlike phonetic features, non-phonetic features in this context are those features, which cannot be derived from the phoneme identities. Although non-phonetic features are used in speech recognition, they are often treated separately and based on various heuristics. In our approach, non-phonetic features are included as additional tags to the decision tree clustering. Moreover, the proposed tagged decision tree is based on the full training data, and therefore, it alleviates the problem of training data depletion in building specific feature dependent acoustic models. Experimental results indicate that up to 10% word error rate reduction can be achieved in a large vocabulary (Wall Street Journal) speech recognition task based on the proposed approach.

SP-18.6

Irrelevant Variability Normalization in Learning HMM State Tying From Data Based on Phonetic Decision-Tree
Qiang Huo, Bin Ma (Department of Computer Science and Information Systems, The University of Hong Kong, Pokfulam Road, Hong Kong)

We propose to apply the concept of irrelevant variability normalization to the general problem of learning structure from data. Because of the problems of a diversified training data set and/or possible acoustic mismatches between training and testing conditions, the structure learned from the training data by using a maximum likelihood training method will not necessarily generalize well on mismatched tasks. We apply the above concept to the structural learning problem of phonetic decision-tree based hidden Markov model (HMM) state tying. We present a new method that integrates a linear-transformation based normalization mechanism into the decision-tree construction process to make the learned structure have a better modeling capability and generalizability. The viability and efficacy of the proposed method are confirmed in a series of experiments for continuous speech recognition of Mandarin Chinese.

SP-18.7

DISCRIMINATIVE SPECTRAL-TEMPORAL MULTI-RESOLUTION FEATURES FOR SPEECH RECOGNITION
Philip McMahon, Naomi Harte, Saeed Vaseghi, Paul McCourt (The Queen�s University of Belfast, Northern Ireland)

Multi-resolution features, which are based on the premise that there may be more cues for phonetic discrimination in a given sub-band than in another, have been shown to outperform the standard MFCC feature set for both classification and recognition tasks on the TIMIT database [5]. This paper presents an investigation into possible strategies to extend these ideas from the spectral domain into both the spectral and temporal domains. Experimental work on the integration of segmental models, which are better at capturing the longer term phonetic correlation of a phonetic unit, into the discriminative multi-resolution framework is presented. Results are presented which show that including this supplementary temporal information offers an improvement performance for the phoneme classification task over the standard multi-resolution MFCC feature set with time derivatives appended. Possible strategies for the extension of theses techniques into the area of continuous speech recognition are discussed.

SP-18.8

On the use of Support Vector Machines for Phonetic Classification
Philip R Clarkson (Cambridge University Engineering Department.), Pedro J Moreno (Compaq Computer Corporation, Cambridge Research Lab)

Support Vector Machines (SVMs) represent a new approach to pattern classification which has recently attracted a great deal of interest in the machine learning community. Their appeal lies in their strong connection to the underlying statistical learning theory, in particular the theory of Structural Risk Minimization. SVMs have been shown to be particularly successful in fields such as image identification and face recognition; in many problems SVM classifiers have been shown to perform much better than other non-linear classifiers such as artificial neural networks and $k$-nearest neighbors. This paper explores the issues involved in applying SVMs to phonetic classification as a first step to speech recognition. We present results on several standard vowel and phonetic classification tasks and show better performance than Gaussian mixture classifiers. We also present an analysis of the difficulties we foresee in applying SVMs to continuous speech recognition problems.

< SP-17 SP-19 >

Last Update: February 4, 1999 Ingo Höntsch