RECOGNITION: TRAINING TECHNIQUES

Chair: Robin Rohlicek, BBN, Inc. (USA)

Home

Speaker-independent Phone Modeling Based on Speaker-dependent HMMs' Composition and Clustering

Authors:

Tetsuo Kosaka, ATR Interpreting Telecommunications Research Laboratories
Shoichi Matsunaga, ATR Interpreting Telecommunications Research Laboratories
Mikio Kuraoka, Toyohashi University of Technology (JAPAN)

Volume 1, Page 441

Abstract:

This paper proposes a novel method for speaker-independent phone modeling based on the composition and clustering of speaker-dependent HMMs. In general, HMM phone models are trained by the Baum-Welch (B-W) algorithm. We, however, propose a speaker-independent phone modeling in which speaker-dependent (SD) HMMs are combined to form speaker-independent (SI) HMMs without parameter re-estimation. Furthermore, by using this method, we investigate how different kinds of reference speakers influence the development of the SI models. The method is evaluated in Japanese phoneme and phrase recognition experiments. Results show that the performance of this method is similar to the conventional B-W algorithms in spite of a greatly reduced computational cost.

300dpi TIFF Images of pages:

441 442 443 444

Acrobat PDF file of whole paper:

ic950441.pdf

TOP

Using Morphology Towards Better Large- Vocabulary Speech Recognition Systems

Authors:

P. Geutner, Universitat Karlsruhe (GERMANY)

Volume 1, Page 445

Abstract:

To guarantee unrestricted natural language processing, state-of-the-art speech recognition systems require huge dictionaries that increase search space and result in performance degradations. This especially applies to languages with a large number of inflections and compound words, e.g. German. One way to keep up decent recognition results with increasing vocabulary is the use of other base units than simply words. In this paper different decomposition methods originally based on morphological decomposition for the German language will be compared. Not only do they counteract the immense vocabulary growth with an increasing amount of training data, but also the rate of out-of-vocabulary words, which worsens recognition performance significantly, is decreased. A smaller dictionary also leads to 30% speed improvement during the recognition process. Moreover even if the amount of available training data is quite huge it is often not enough to guarantee robust language model estimations, whereas morphem-based models are capable to do so.

300dpi TIFF Images of pages:

445 446 447 448

Acrobat PDF file of whole paper:

ic950445.pdf

TOP

Optimal Splitting of HMM Gaussian Mixture Components with MMIE Training

Authors:

Yves Normandin, Centre de Recherche informatique de Montreal (CANADA)

Volume 1, Page 449

Abstract:

A novel approach to splitting Gaussian mixture components based on the use of MMIE training is proposed. The idea is to increase acoustic resolution only in those distributions where discrimination problems are identified. Problem mixture components are determined by looking at each mixture weight counter; a large positive counter value indicates both that the component often tends not to be recognized correctly (i.e., is not part of the best path when it should be) and that there is sufficient training data to split the component. Results in a connected digit recognition experiment on the TIDIGITS corpus indicate that much better results can be obtained with such MMIE trained digit models than with MLE trained models that use several times more mixture components.

300dpi TIFF Images of pages:

449 450 451 452

Acrobat PDF file of whole paper:

ic950449.pdf

TOP

Dictionary Learning: Performance through Consistency

Authors:

Tilo Sloboda, Universitat Karlsruhe (GERMANY)

Volume 1, Page 453

Abstract:

We present first results from our efforts in automatically increasing and adapting phonetic dictionaries for spontaneous speech recognition. For phonetic dictionaries (especially for spontaneous speech) it is important to choose the pronunciations of a word according to their frequency in the database rather than the ``correct'' pronunciation as it might be found in a lexicon. Modifications of the dictionary should not lead to a higher phoneme confusability. Therefore we propose a data-driven approach to add new pronunciations to a given phonetic dictionary, in a way that they model the given occurrences of words in the database. We show how even a simple approach can lead to significant improvements in recognition performance. Experiments have been performed on the German Spontaneous Scheduling Task (GSST), using the speech recognition engine of JANUS-2, the spontaneous speech-to-speech translation system of the Interactive Systems Laboratories at Carnegie Mellon and Karlsruhe University.

300dpi TIFF Images of pages:

453 454 455 456

Acrobat PDF file of whole paper:

ic950453.pdf

TOP

Incremental MAP Estimation of HMMs for Efficient Training and Improved Performance

Authors:

Yoshihoko Gotoh, Brown University (USA)
Michael M. Hochberg, Cambridge University (UK)
Daniel J. Mashao, Brown University (USA)
Harvey F. Silverman, Brown University (USA)

Volume 1, Page 457

Abstract:

Continuous density observation hidden Markov models (CD-HMMs) have been shown to perform better than their discrete counterparts. However, because the observation distribution is usually represented with a mixture of multivariate normal densities, the training time for a CD-HMM can be prohibitively long. This paper presents a new approach to speed-up the convergence of CD-HMM training using a stochastic, incremental variant of the EM algorithm. The algorithm randomly selects a subset of data from the training set, updates the model using maximum a posteriori estimation, and then iterates until convergence. Experimental results show that the convergence of this approach is nearly an order of magnitude faster than the standard batch training algorithm. In addition, incremental learning of the model parameters improved recognition performance compared with the batch version.

300dpi TIFF Images of pages:

457 458 459 460

Acrobat PDF file of whole paper:

ic950457.pdf

TOP

Discrete MMI Probability Models for HMM Speech Recognition

Authors:

J.T. Foote, Cambridge University (UK)

Volume 1, Page 461

Abstract:

This paper presents a method of non-parametrically modeling HMM output probabilities. Discrete output probabilities are estimated from a tree-based MMI partition of the feature space, rather than the usual vector quantization. One advantage of a decision-tree method is that very high-dimensional spaces can be partitioned. Time variation can then be explicitly modeled by concatenating time-adjacent vectors, which is shown to improve recognition performance. Though the model is discrete, it provides recognition performance better than 1-component Gaussian mixture HMMs on the ARPA Resource Management (RM) task. This method is not without drawbacks: because of its non-parametric nature, a large number of parameters are needed for a good model and the available RM training data is probably not sufficient. Besides the computational advantages of a discrete model, this method has promising applications in talker identification, adaptation, and clustering.

300dpi TIFF Images of pages:

461 462 463 464

Acrobat PDF file of whole paper:

ic950461.pdf

TOP

Global Discrimination for Neural Predictive Systems Based on N-Best Algorithm

Authors:

Abdelhamid Mellouk, LRI UA 410 CNRS
Patrick Gallinari, LAFORIA UA CNRS 1095 (FRANCE)

Volume 1, Page 465

Abstract:

We describe here a general formalism for training neural predictive systems. We then introduce discrimination at the frame level and show how it relates to maximum mutual information training. Last, we propose an approach for performing discrimination in predictive systems at the sequence level; it makes use of N-Best sequence selection. Performances of acoustic-phonetic decoding reach 77.4% phone accuracy on 1988 version of TIMIT.

300dpi TIFF Images of pages:

465 466 467 468

Acrobat PDF file of whole paper:

ic950465.pdf

TOP

Enhancement of Discriminative Capabilities of HMM Based Recognizer through Modification of Viterbi Algorithm

Authors:

Jianming Song, The University of Woolongong (AUSTRALIA)

Volume 1, Page 469

Abstract:

The algorithm proposed in this paper integrates the concepts of variable frame rate and discriminative analysis based on Tanimoto ratio to modify the conventional Viterbi algorithm, in such a way that the steady or stationary signal is compressed, while transitional or non-stationary signal is emphasized through the frame-by-frame searching process. The usefulness of each frame is decided entirely within the Viterbi process and needs not to be the same for different models. To evaluate this algorithm, we tested a speech database of 9 highly confusable E-set English letters. With 5 state and 6 mixture components, the conventional HMM baseline system only delivered the recognition accuracy of 73.9%. In the preliminary experiment using the algorithm proposed in this paper, the recognition accuracy was increased to 82.5%.

300dpi TIFF Images of pages:

469 470 471 472

Acrobat PDF file of whole paper:

ic950469.pdf

TOP

A Generalization of the Baum Algorithm to Functions on Non-linear Manifolds

Authors:

D. Kanevsky, IBM T.J. Watson Research Center (USA)

Volume 1, Page 473

Abstract:

The well-known Baum-Eagon inequality provides an effective iterative scheme for homogeneous polynomials with positive coefficients over a domain of probability values . However, in many applications (e.g. corrective training) we are interested in maximizing an objective function over a domain that is different from the domain of probability values and may be defined by non-linear constraints. In the paper we show how to extend the basic Baum-Eagon inequality to (not necessary rational) functions that are defined on general manifolds. We describe an effective iterative scheme that is based on this inequality and its application to estimation problems via minimum information discrimination.

300dpi TIFF Images of pages:

473 474 475 476

Acrobat PDF file of whole paper:

ic950473.pdf

TOP

Data-Driven Codebook Adaptation in Phonetically Tied SCHMMs

Authors:

Thomas Kemp, Universitat Karlsruhe (GERMANY)

Volume 1, Page 477

Abstract:

This paper reports the results of our experiments aimed at the automatic optimization of the number of parameters in the semi-continuous phonetically tied HMM based speech recognition system that is part of the speech-to-speech translation system JANUS-2. We propose different algorithms devised to determine the optimal number of model parameters. In recognition experiments performed on a spontaneous human-to-human dialog database, we show that automatic optimization of the acoustic modeling parameter size with the proposed algorithm improves the recognition performance without increasing the required amount of computing power and memory.

300dpi TIFF Images of pages:

477 478 479

Acrobat PDF file of whole paper:

ic950477.pdf