Chair: Hsiao-Wuen Hon, Apple ISS Research Centre, University of Singapore (SINGAPORE)
Satoshi Takahashi, NTT Human Interface Laboratories (JAPAN)
Shigeki Sagayama, NTT Human Interface Laboratories (JAPAN)
One of the problems with context-dependent HMM's is that a large number of model parameters should be estimated using a limited amount of training data. Parameters that have the same property should be tied in order to represent acoustic models efficiently. This paper proposes four-level 1) model level, 2) state level, 3) distribution level, and 4) feature parameter level. Although some techniques have been proposed for the first three levels, feature parameter tying in the fourth level is newly proposed in this paper. We found that feature parameter tying makes it possible to represent 1,600 mean vectors of multivariate Gaussian mixture HMM's by using the combination of 16 representative mean values in each dimension. Experimental results show that feature parameter tying reduces the amount of calculation required for recognition without significant degrading performance. Furthermore, we found that feature parameter tying is also effective for model training.
Christian Dugast, Philips Research Laboratories (GERMANY)
Peter Beyerlein, Philips Research Laboratories (GERMANY)
Reinhold Haeb- Umbach, Philips Research Laboratories (GERMANY)
Clustering techniques have been integrated at different levels into the training procedure of a continuous-density hidden Markov model (HMM) speech recognizer. These clustering techniques can be used in two ways. First acoustically similar states are tied together. It will help to reduce the number of parameters but also allow to train otherwise rarely seen states together with more robust ones (state-tying). Secondly densities are clustered across states, this reduces the number of densities while at the same time keeping the best performances of our recognizer (density-clustering). We have applied these techniques both to word- based small-vocabulary and phoneme-based large-vocabulary recognition tasks. On the WSJ task, we could achieve a reduction of the word error rate by 7%. On the TI/NIST-connected digit task, the number of parameters was reduced by a factor 2-3 while keeping the same string error rate.
Michael D. Monkowski, IBM - TJ Watson Research Center (USA)
Michael A. Picheny, IBM - TJ Watson Research Center (USA)
P. Srinivasa Rao, IBM - TJ Watson Research Center (USA)
Phonetic context was used to predict the durations of phones using a decision tree. These predictions were used to calculate context dependent HMM transition probabilities for these phone models, which were used to decode telephone conversations from the SwitchBoard corpus. We observed that the duration models do not appreciably improve the word error rate; that more can be gained by modeling phone durations within words than by adjusting for local average speaking rates; and conclude that local or global variations in speaking rate are not major contributors to the observed high error rates for SwitchBoard.
Jun He, Faculte Polytechnique de Mons (BELGIUM)
Henri Leich, Faculte Polytechnique de Mons (BELGIUM)
There are two major approaches to speech recognition: frame- based and segment-based approach. Frame-based approach, e.g. HMM, assumes statistical independence and identical distribution of observation in each state. In addition it incorporates weak duration constrains. Segment-based approach is computational expensive and rough modelling easilly occurs if not much 'templates' are stored. This paper presents a new framework to incorporate segmental feature and segmental model in a unified way into frame-based HMM to exploit the advantage of both methods. In the modified Viterbi algorithm, frame-based information prunes out the most probable path at each segment level to which segmental model can be applied with dramatically reduced computational load; at the same time, segmental score refines the score obtained by frame-based model at each level. In this way, the best path found in the end of Viterbi algorithm is optimal in both sense
Wendy J. Holmes, DRA Malvern (UK)
Martin J. Russell, DRA Malvern (UK)
The aim of the research described in this paper is to overcome important speech-modeling limitations of conventional HMMs, by developing a dynamic segmental HMM which models the changing pattern of speech over the duration of some phoneme-type unit. As a first step towards this goal, a static segmental HMM has been implemented and tested. This model reduces the influence of the independence assumption by using two processes to model variability due to long-term factors separately from local variability within a segment. Experiments have demonstrated that the performance of segmental HMMs relative to conventional HMMs is dependent on the quality of the system in which they are embedded. On a connected-digit recognition task for example, static segmental HMMs outperformed conventional HMMs for triphone systems but not for a vocabulary-independent monophone system. It is concluded that static segmental HMMs improve performance, as long as the system is such that the independence assumption is a major limiting factor.
Helmut Lucke, ATR/ITL (JAPAN)
This paper argues that many HMM model inaccuracies are a direct consequence of the fact that the HMM is a one dimensional stochastic model applied to a two dimensional process. Thus we argue that a 2D stochastic process, known as a Random Markov Field (MRF) should perform better. We describe a training method for MRFs and analyze its convergence behavior.
F. Wolfertstetter, Munich University of Technology (GERMANY)
G. Ruske, Munich University of Technology (GERMANY)
This paper proposes a new modeling of the structure of speech units as a graph consisting of base functions and a transition network. A cluster algorithm taking into account the actual temporal context of the feature vectors is used to generate the base functions, which are approximated by normal distributions. The subsequent maximum-likelihood training procedure establishes the transition network and adjusts the transition probabilities. The emerging graphs for the speech units are a structure of branching and recombining trajectory segments describing statistical dependencies in the feature vector sequence within the speech units as well as in the transition regions between them. A speaker-independent evaluation shows the superiority of the proposed modeling compared to mixture-state HMMs, even for an equal number of model parameters.
David Burshtein, Tel-Aviv University (ISRAEL)
We address the problem of explicit state and word duration modeling in hidden Markov models (HMMs). A major weakness of conventional HMMs is that they implicitly model state durations by a Geometric distribution, which is usually inappropriate. Using explicit modeling of state and word durations, it is possible to significantly enhance the performance of speech recognition systems. The main outcome of this work is a modified Viterbi algorithm that by incorporating both state and word duration modeling, reduces the string error rate of the conventional Viterbi algorithm by 29% and 43% for known and unknown string lengths respectively, for a speaker independent, connected digit string task. The uniqueness of the algorithm is that unlike alternative approaches, it adds the duration metric at each frame transition (and not at the end of a state, word or sentence), thus enhancing the performance.
Roland Kuhn, CRIM (CANADA)
Ariane Lazarides, CRIM (CANADA)
Yves Normandin, CRIM (CANADA)
Julie Brousseau, CRIM (CANADA)
Bahl et al (1991) employed decision trees to specify the acoustic realization of a phone in context. They used a computationally cheap Poisson-based criterion to evaluate a large number of questions about the context quickly. We extend this work in four ways: 1. We employ the Poisson criterion to find the M best questions at a node during tree expansion, then use an HMM-based MLE criterion to make the final choice from these. 2. Bahl et al use stopping criteria to halt the growth of a tree; we apply the Gelfand et al (1991) iterative expansion-pruning algorithm. 3. In addition to the ``YES'' and ``NO'' children of each question, we grow a ``DON'T KNOW'' subtree to be used if a question is currently unanswerable. 4. We experiment with questions based on phonetic features in the context, as well as questions about the presence of specific phones.
Takao Watanabe, NEC Corporation (JAPAN)
Koichi Shinoda, NEC Corporation (JAPAN)
Keizaburo Takagi, NEC Corporation (JAPAN)
Ken-ichi Iso, NEC Corporation (JAPAN)
This paper proposes a new speech recognition method using tree-structured probability density function(pdf) to realize high speed HMM based speech recognition. In order to reduce likelihood calculation for a pdf set composed of the Gaussian pdfs for all mixture components, all states and all recognition units, it is coarsely done for the element pdf whose likelihood is not likely to be large. The pdf set is expressed as the tree-structured form. In the recognition, the likelihood set is calculated by searching the tree; by calculating the likelihood from the cluster pdf at the node and traversing the nodes with the largest likelihood from the root. Experimental results showed that the computation amount was drastically reduced with little recognition accuracy, in both speaker-independent and speaker-adaptive cases. The algorithm was applied to a speech recognition software for Personal Computer without using special hardware.