USE OF KNOWLEDGE IN ASR

Chair: Jim Glass, Massachusetts Institute of Technology, LCS (USA)

Home

Analysis of Acoustic-Phonetic Variations in Fluent Speech Using TIMIT

Authors:

Don X. Sun, State University of New York (USA)
Li Deng, University of Waterloo (CANADA)

Volume 1, Page 201

Abstract:

In this paper, we propose a hierarchically structured Analysis of Variance (ANOVA) method to analyze, in a quantitative manner, the contributions of various identifiable factors to the overall acoustic variability exhibited in fluent speech data of TIMIT processed in the form of Mel-Frequency Cepstral Coefficients. The results of the analysis show that the greatest acoustic variability in TIMIT data is explained by the difference among distinct phonetic labels in TIMIT, followed by the phonetic context difference given a fixed phonetic label. The variability among sequential sub-segments within each TIMIT-defined phonetic segment is found to be significantly greater than the gender, dialect region, and speaker factors.

300dpi TIFF Images of pages:

201 202 203 204

Acrobat PDF file of whole paper:

ic950201.pdf

TOP

Analyzing Weaknesses of Language Models for Speech Recognition

Authors:

Joerg P. Ueberla, DRA Malvern (UK)

Volume 1, Page 205

Abstract:

In this paper, we propose to analyse the weaknesses of language models for speech recognition, in order to subsequently improve the models. First, a definition of a weakness of a language model that is applicable to almost all currently used models is given. This definition is then applied to a class based bi-gram model. The results show that one can gain considerable insight into a model by analysing its weaknesses. Moreover, when the model was modified in order to avoid one of the weaknesses, the modeling of unknown words, the performance of the model improved significantly.

300dpi TIFF Images of pages:

205 206 207 208

Acrobat PDF file of whole paper:

ic950205.pdf

TOP

A Hidden Markov Model with Optimized Inter- Frame Dependence

Authors:

F. J. Smith, The Queen's University (N. IRELAND)
J. Ming, The Queen's University (N. IRELAND)
P. O'Boyle, The Queen's University (N. IRELAND)
A.D. Irvine, The Queen's University (N. IRELAND)

Volume 1, Page 209

Abstract:

An optimized hidden Markov model (HMM) with two kinds of inter-frame dependent observation structures, both built on the observation densities of a first-order dependent form, is presented to account for the statistical dependence between successive frames. In the first model, the dependence relation among the frames is determined optimally by maximizing the likelihood of the observations in both training and testing. In the second model, the dependence structure associated with each frame is described by a weighted sum of the conditional densities of the frame given individual previous frames. The segmental K-means and the forward-backward algorithms are implemented, respectively, for the estimation of the parameters of the two models. Experimental comparisons for an isolated word recognition task show that these models achieve better performance than both the standard continuous HMM and the bigram-constrained HMM.

300dpi TIFF Images of pages:

209 210 211 212

Acrobat PDF file of whole paper:

ic950209.pdf

TOP

On the Use of Scaler Quantization for Fast HMM Computation

Authors:

Shigeki Sagayama, NTT Human Interface Laboratories (JAPAN)
Satoshi Takahashi, NTT Human Interface Laboratories (JAPAN)

Volume 1, Page 213

Abstract:

This paper describes an algorithm for reducing the amount of arithmetic operations in the likelihood computation of continous mixture HMM (CMHMM) with diagonal covariance matrices while retaining high performance. The key points are the use of the scalar quantization of the input observation vector components and table look-up. These make multiplication, squaring and division operations entirely unnecessary in the whole HMM computation (i.e. output probability calculation and trellis/Viterbi computation). It is experimentally proved in an large-vocabulary isolated word recognition task that scalar quantization into no less than 16 levels does not cause significant degradation in the speech recognition performance. Scalar quantization is also utilized in the computation truncation for unlikely distributions; the total number of distribution likelihood computations can be reduced by 66% with only a slight performance degradation. This ``multiplication-free'' HMM algorithm has high potentiality in speech recognition applications on personal computers.

300dpi TIFF Images of pages:

213 214 215 216

Acrobat PDF file of whole paper:

ic950213.pdf

TOP

Large-vocabulary Speech Recognition in Specialized Domains

Authors:

Haakon Chevalier, Dragon Systems Inc. (USA)
Chuck Ingold, Dragon Systems Inc. (USA)
Carol Kunz, Dragon Systems Inc. (USA)
Chip Moore, Dragon Systems Inc. (USA)
Crispen Roven, Dragon Systems Inc. (USA)
Jon Yamron, Dragon Systems Inc. (USA)
Bradley Baker, Dragon Systems Inc. (USA)
Paul Bamberg, Dragon Systems Inc. (USA)
Sarah Bridle, Dragon Systems Inc. (USA)
Tracy Bruce, Dragon Systems Inc. (USA)
Amy Weader, Dragon Systems Inc. (USA)

Volume 1, Page 217

Abstract:

In this summary, we report on research into the discrete-word speech-recognition performance of several specialized language models optimized for 4 large domains of professional discourse. We describe the construction of these models and report perplexity and recognition results for each of the specialized domains. The data indicate that such specialization may significantly improve performance, both before and after adaptation.

300dpi TIFF Images of pages:

217 218 219 220

Acrobat PDF file of whole paper:

ic950217.pdf

TOP

Understanding and Improving Speech Recognition Performance through the Use of Diagnostic Tools

Authors:

Ellen Eide, BBN Systems and Technologies (USA)
Herbert Gish, BBN Systems and Technologies (USA)
Philippe Jeanrenaud, BBN Systems and Technologies (USA)
Angela Mielke, BBN Systems and Technologies (USA)

Volume 1, Page 221

Abstract:

The goal of this work is to highlight aspects of an experiment other than the word error rate. When a speech recognition experiment is performed, the word error rate provides no insight into the factors responsible for the recognition errors. We begin this paper by describing an experiment which contrasts the language of conversational speech with that of text from the Wall Street Journal. The remainder of the paper is devoted to the description of a more general approach to performance diagnosis which identifies significant sources of error in a given experiment. The technique is based on the use of binary classification trees; we refer to the results of our analyses as diagnostic trees. Beyond providing understanding, diagnostic trees allow for improvements in the performance of a recognizer through the use of feedback provided by quantifying confidence in the recognition.

300dpi TIFF Images of pages:

221 222 223 224

Acrobat PDF file of whole paper:

ic950221.pdf

TOP

Phrase Bigrams for Continuous Speech Recognition

Authors:

Egidio P. Giachin, CSELT (ITALY)

Volume 1, Page 225

Abstract:

In some speech recognition tasks, such as man-machine dialogue systems, the spoken sentences include several recurrent phrases. A bigram language model does not adequately represent these phrases because it underestimates their probability. A better approach consists of modeling phrases as if they were individual dictionary elements. They are inserted as additional entries into the word lexicon, on which bigrams are finally computed. This paper discusses two procedures for automatically determining frequent phrases in an unlabeled training set of written sentences. One procedure is optimal since it minimizes the set perplexity. The other, based on information theoretic criteria, insures that the resulting model has a high statistical robustness. The two procedures are tested on a 762-word spontaneous speech recognition task. They give similar results and provide a moderate improvement over standard bigrams.

300dpi TIFF Images of pages:

225 226 227 228

Acrobat PDF file of whole paper:

ic950225.pdf

TOP

Using Explicit Segmentation to Improve HMM Phone Recognition

Authors:

Carl D. Mitchell, Purdue University (USA)
Mary P. Harper, Purdue University (USA)
Leah H. Jamieson, Purdue University (USA)

Volume 1, Page 229

Abstract:

We show that many of the errors in a context-dependent phone recognition system are due to poor segmentation. We then suggest a method to incorporate explicit segmentation information directly into the HMM paradigm. The utility of explicit segmentation information is illustrated with experiments involving five types of segmentation information and three methods of smoothing.

300dpi TIFF Images of pages:

229 230 231 232

Acrobat PDF file of whole paper:

ic950229.pdf

TOP

Viterbi Algorithm for Acoustic Vectors Generated by a Linear Stochastic Differential Equation on Each State

Authors:

Marco Saerens, Universite Libre de Bruxelles (BELGIUM)

Volume 1, Page 233

Abstract:

When using hidden Markov models for speech recognition, it is usually assumed that the probability that a particular acoustic vector is emitted at a given time only depends on the current state and the current acoustic vector observed. In this paper, we introduce another idea, i.e. we assume that, in a given state, the acoustic vectors are generated by a linear stochastic differential equation. This extends our previous model, in which we assumed that the acoustic vectors are generated by a continuous Markov process. This work is motivated by the fact that the time evolution of the acoustic vector is inherently dynamic and continuous, so that the modelling could be performed in the continuous-time domain instead of the discrete-time domain. By the way, the links between the discrete-time model obtained after sampling, and the original continuous-time signal are not so trivial. In particular, the relationship between the coefficients of a continuous-time linear process and the coefficients of the discrete-time linear process obtained after sampling is nonlinear. We assign a probability density to the continuous-time trajectory of the acoustic vector inside the state, reflecting the probability that this particular path has been generated by the stochastic differential equation associated with this state. This allows us to compute the likelihood of the uttered word. Reestimation formulae for the parameters of the process, based on the maximization of the likelihood, can be derived for the Viterbi algorithm. As usual, the segmentation can be obtained by sampling the continuous process, and by applying dynamic programming to find the best path over all the possible sequences of states.

300dpi TIFF Images of pages:

233 234 235 236

Acrobat PDF file of whole paper:

ic950233.pdf

TOP

Non Deterministic Stochastic Language Models for Speech Recognition

Authors:

G. Riccardi, AT&T Bell Laboratories (USA)
E. Bocchieri, AT&T Bell Laboratories (USA)
R. Pieraccini, AT&T Bell Laboratories (USA)

Volume 1, Page 237

Abstract:

Traditional stochastic language models for speech recognition (i.e. $n$-grams) are deterministic, in the sense that there is one and only one derivation for each given sentence. Moreover a fixed temporal window is always assumed in the estimation of the traditional stochastic language models. This paper shows how non-determinism is introduced to effectively approximate a back-off n-gram language model through a finite state network formalism. It also shows that a new flexible and powerful network formalization can be obtained by releasing the assumption of a fixed history size. As a result, a class of automata for language modeling (Variable Ngram Stochastic Automata) is obtained, for which we propose some methods for the estimation of the transition probabilities . VNSAs have been used in a spontaneous speech recognizer for the ATIS task. The accuracy on a standard test set is presented in this paper.

300dpi TIFF Images of pages:

237 238 239 240

Acrobat PDF file of whole paper:

ic950237.pdf