RECOGNITION: MODELING STRUCTURES

Chair: Hsiao-Wuen Hon, Apple ISS Research Centre, University of Singapore (SINGAPORE)

Home

Four-Level Tied-Structure for Efficient Representation of Acoustic Modeling

Authors:

Satoshi Takahashi, NTT Human Interface Laboratories (JAPAN)
Shigeki Sagayama, NTT Human Interface Laboratories (JAPAN)

Volume 1, Page 520

Abstract:

One of the problems with context-dependent HMM's is that a large number of model parameters should be estimated using a limited amount of training data. Parameters that have the same property should be tied in order to represent acoustic models efficiently. This paper proposes four-level 1) model level, 2) state level, 3) distribution level, and 4) feature parameter level. Although some techniques have been proposed for the first three levels, feature parameter tying in the fourth level is newly proposed in this paper. We found that feature parameter tying makes it possible to represent 1,600 mean vectors of multivariate Gaussian mixture HMM's by using the combination of 16 representative mean values in each dimension. Experimental results show that feature parameter tying reduces the amount of calculation required for recognition without significant degrading performance. Furthermore, we found that feature parameter tying is also effective for model training.

300dpi TIFF Images of pages:

520 521 522 523

Acrobat PDF file of whole paper:

ic950520.pdf

TOP

Application of Clustering Techniques to Mixture Density Modelling for Continuous-Speech Recognition

Authors:

Christian Dugast, Philips Research Laboratories (GERMANY)
Peter Beyerlein, Philips Research Laboratories (GERMANY)
Reinhold Haeb- Umbach, Philips Research Laboratories (GERMANY)

Volume 1, Page 524

Abstract:

Clustering techniques have been integrated at different levels into the training procedure of a continuous-density hidden Markov model (HMM) speech recognizer. These clustering techniques can be used in two ways. First acoustically similar states are tied together. It will help to reduce the number of parameters but also allow to train otherwise rarely seen states together with more robust ones (state-tying). Secondly densities are clustered across states, this reduces the number of densities while at the same time keeping the best performances of our recognizer (density-clustering). We have applied these techniques both to word- based small-vocabulary and phoneme-based large-vocabulary recognition tasks. On the WSJ task, we could achieve a reduction of the word error rate by 7%. On the TI/NIST-connected digit task, the number of parameters was reduced by a factor 2-3 while keeping the same string error rate.

300dpi TIFF Images of pages:

524 525 526 527

Acrobat PDF file of whole paper:

ic950524.pdf

TOP

Context Dependent Phonetic Duration Models for Decoding Conversational Speech

Authors:

Michael D. Monkowski, IBM - TJ Watson Research Center (USA)
Michael A. Picheny, IBM - TJ Watson Research Center (USA)
P. Srinivasa Rao, IBM - TJ Watson Research Center (USA)

Volume 1, Page 528

Abstract:

Phonetic context was used to predict the durations of phones using a decision tree. These predictions were used to calculate context dependent HMM transition probabilities for these phone models, which were used to decode telephone conversations from the SwitchBoard corpus. We observed that the duration models do not appreciably improve the word error rate; that more can be gained by modeling phone durations within words than by adjusting for local average speaking rates; and conclude that local or global variations in speaking rate are not major contributors to the observed high error rates for SwitchBoard.

300dpi TIFF Images of pages:

528 529 530 531

Acrobat PDF file of whole paper:

ic950528.pdf

TOP

A Unified Way in Incorporating Segmental Feature and Segmental Model into HMM

Authors:

Jun He, Faculte Polytechnique de Mons (BELGIUM)
Henri Leich, Faculte Polytechnique de Mons (BELGIUM)

Volume 1, Page 532

Abstract:

There are two major approaches to speech recognition: frame- based and segment-based approach. Frame-based approach, e.g. HMM, assumes statistical independence and identical distribution of observation in each state. In addition it incorporates weak duration constrains. Segment-based approach is computational expensive and rough modelling easilly occurs if not much 'templates' are stored. This paper presents a new framework to incorporate segmental feature and segmental model in a unified way into frame-based HMM to exploit the advantage of both methods. In the modified Viterbi algorithm, frame-based information prunes out the most probable path at each segment level to which segmental model can be applied with dramatically reduced computational load; at the same time, segmental score refines the score obtained by frame-based model at each level. In this way, the best path found in the end of Viterbi algorithm is optimal in both sense

300dpi TIFF Images of pages:

532 533 534 535

Acrobat PDF file of whole paper:

ic950532.pdf

TOP

Experimental Evaluation of Segmental HMMs

Authors:

Wendy J. Holmes, DRA Malvern (UK)
Martin J. Russell, DRA Malvern (UK)

Volume 1, Page 536

Abstract:

The aim of the research described in this paper is to overcome important speech-modeling limitations of conventional HMMs, by developing a dynamic segmental HMM which models the changing pattern of speech over the duration of some phoneme-type unit. As a first step towards this goal, a static segmental HMM has been implemented and tested. This model reduces the influence of the independence assumption by using two processes to model variability due to long-term factors separately from local variability within a segment. Experiments have demonstrated that the performance of segmental HMMs relative to conventional HMMs is dependent on the quality of the system in which they are embedded. On a connected-digit recognition task for example, static segmental HMMs outperformed conventional HMMs for triphone systems but not for a vocabulary-independent monophone system. It is concluded that static segmental HMMs improve performance, as long as the system is such that the independence assumption is a major limiting factor.

300dpi TIFF Images of pages:

536 537 538 539

Acrobat PDF file of whole paper:

ic950536.pdf

TOP

Improved Acoustic Modeling for Speech Recognition Using 2D Markov Random Fields

Authors:

Helmut Lucke, ATR/ITL (JAPAN)

Volume 1, Page 540

Abstract:

This paper argues that many HMM model inaccuracies are a direct consequence of the fact that the HMM is a one dimensional stochastic model applied to a two dimensional process. Thus we argue that a 2D stochastic process, known as a Random Markov Field (MRF) should perform better. We describe a training method for MRFs and analyze its convergence behavior.

300dpi TIFF Images of pages:

540 541 542 543

Acrobat PDF file of whole paper:

ic950540.pdf

TOP

Structured Markov Models for Speech Recognition

Authors:

F. Wolfertstetter, Munich University of Technology (GERMANY)
G. Ruske, Munich University of Technology (GERMANY)

Volume 1, Page 544

Abstract:

This paper proposes a new modeling of the structure of speech units as a graph consisting of base functions and a transition network. A cluster algorithm taking into account the actual temporal context of the feature vectors is used to generate the base functions, which are approximated by normal distributions. The subsequent maximum-likelihood training procedure establishes the transition network and adjusts the transition probabilities. The emerging graphs for the speech units are a structure of branching and recombining trajectory segments describing statistical dependencies in the feature vector sequence within the speech units as well as in the transition regions between them. A speaker-independent evaluation shows the superiority of the proposed modeling compared to mixture-state HMMs, even for an equal number of model parameters.

300dpi TIFF Images of pages:

544 545 546 547

Acrobat PDF file of whole paper:

ic950544.pdf

TOP

Robust Parametric Modeling of Durations in Hidden Markov Models

Authors:

David Burshtein, Tel-Aviv University (ISRAEL)

Volume 1, Page 548

Abstract:

We address the problem of explicit state and word duration modeling in hidden Markov models (HMMs). A major weakness of conventional HMMs is that they implicitly model state durations by a Geometric distribution, which is usually inappropriate. Using explicit modeling of state and word durations, it is possible to significantly enhance the performance of speech recognition systems. The main outcome of this work is a modified Viterbi algorithm that by incorporating both state and word duration modeling, reduces the string error rate of the conventional Viterbi algorithm by 29% and 43% for known and unknown string lengths respectively, for a speaker independent, connected digit string task. The uniqueness of the algorithm is that unlike alternative approaches, it adds the duration metric at each frame transition (and not at the end of a state, word or sentence), thus enhancing the performance.

300dpi TIFF Images of pages:

548 549 550 551

Acrobat PDF file of whole paper:

ic950548.pdf

TOP

Improved Decision Trees for Phonetic Modeling

Authors:

Roland Kuhn, CRIM (CANADA)
Ariane Lazarides, CRIM (CANADA)
Yves Normandin, CRIM (CANADA)
Julie Brousseau, CRIM (CANADA)

Volume 1, Page 552

Abstract:

Bahl et al (1991) employed decision trees to specify the acoustic realization of a phone in context. They used a computationally cheap Poisson-based criterion to evaluate a large number of questions about the context quickly. We extend this work in four ways: 1. We employ the Poisson criterion to find the M best questions at a node during tree expansion, then use an HMM-based MLE criterion to make the final choice from these. 2. Bahl et al use stopping criteria to halt the growth of a tree; we apply the Gelfand et al (1991) iterative expansion-pruning algorithm. 3. In addition to the ``YES'' and ``NO'' children of each question, we grow a ``DON'T KNOW'' subtree to be used if a question is currently unanswerable. 4. We experiment with questions based on phonetic features in the context, as well as questions about the presence of specific phones.

300dpi TIFF Images of pages:

552 553 554 555

Acrobat PDF file of whole paper:

ic950552.pdf

TOP

High Speed Speech Recognition Using Tree- Structured Probability Density Function

Authors:

Takao Watanabe, NEC Corporation (JAPAN)
Koichi Shinoda, NEC Corporation (JAPAN)
Keizaburo Takagi, NEC Corporation (JAPAN)
Ken-ichi Iso, NEC Corporation (JAPAN)

Volume 1, Page 556

Abstract:

This paper proposes a new speech recognition method using tree-structured probability density function(pdf) to realize high speed HMM based speech recognition. In order to reduce likelihood calculation for a pdf set composed of the Gaussian pdfs for all mixture components, all states and all recognition units, it is coarsely done for the element pdf whose likelihood is not likely to be large. The pdf set is expressed as the tree-structured form. In the recognition, the likelihood set is calculated by searching the tree; by calculating the likelihood from the cluster pdf at the node and traversing the nodes with the largest likelihood from the root. Experimental results showed that the computation amount was drastically reduced with little recognition accuracy, in both speaker-independent and speaker-adaptive cases. The algorithm was applied to a speech recognition software for Personal Computer without using special hardware.

300dpi TIFF Images of pages:

556 557 558 559

Acrobat PDF file of whole paper:

ic950556.pdf