RECOGNITION: FEATURE ANALYSIS

Chair: Shigeki Sagayama, NTT (JAPAN)

Home


Statistical Modeling of Speech Feature Vector Trajectories Based on a Piecewise Continuous Mean Path

Authors:

Mark M. Thomson, University of Auckland (NEW ZEALAND)

Volume 1, Page 361

Abstract:

This paper presents a new statistical model of the trajectories of speech feature vectors. In this model each vector is assumed to correspond to a point on a mean path that consists of a number of concatenated straight line segments. The model characterizes both the deviation of the trajectory from the mean path and the deviation from the mean rate at which the vectors move through the vector space in a way that avoids the conditional independence assumption implicit in hidden Markov modeling. The model is formulated using a state space approach in which the state vector consists of only two elements. These represent the position on the mean path corresponding to the present observation vector and the rate at which points on the mean path are moving through the vector space. A method for estimating the parameters of the model using the Expectation Maximization algorithm is presented.

300dpi TIFF Images of pages:

361 362 363 364

Acrobat PDF file of whole paper:

ic950361.pdf

TOP



Trace-Segmentation of Isolated Utterances for Speech Recognition

Authors:

Euvaldo F. Cabral Jr., University of Sao Paulo (BRAZIL)
Graham D. Tattersall, University of East Anglia (UK)

Volume 1, Page 365

Abstract:

Trace-segmentation (TS) is a method for non-linear time-normalization of a sequence of speech representation frames prior to recognition of the sequence. Numerous attempts to perform speech recognition using trace-segmentation have been made in the past but these attempts have failed to provide the same performance as DTW of HMM recognition. The reason for this failure may be due to the use of inappropriate distance metrics to perform the segmentation or the use of an inappropriate spatial sampling interval along the trace. This paper describes an investigation into these problems, in which the appropriate Nyquist sample rate of the spatial trace is determined by analyzing the frequency of the temporal variation of the speech frames. It is also shown that separate segmentation of the trajectory described by each individual coefficient in the speech frame leads to much improved recognition which exceeds the performance provided by DTW recognition of the same database.

300dpi TIFF Images of pages:

365 366 367 368

Acrobat PDF file of whole paper:

ic950365.pdf

TOP



Optimal Linear Feature Transformations for Semi-Continuous Hidden Markov Models

Authors:

E. Gunter Schukat-Talamazzini, Universitat Erlangen-Nurnberg (GERMANY)
Joachim Hornegger, Universitat Erlangen-Nurnberg (GERMANY)
Heinrich Niemann, Universitat Erlangen-Nurnberg (GERMANY)

Volume 1, Page 369

Abstract:

Linear discriminant or Karhunen-Loeve transforms are established techniques for mapping features into a lower dimensional subspace. This paper introduces a uniform statistical framework, where the computation of the optimal feature reduction is formalized as a Maximum-Likelihood estimation problem. Reestimation techniques for the parameters of the desired rotation matrix are obtained by a suitable extension of the Baum-Welch algorithm. The rotation- dependent objective function together with the respective partial derivatives are deduced from the Kullback-Leibler statistic of the model. Finally, a preliminary experimental evaluation is presented which shows a slight improvement of phone recognition accuracy when the suggested linear selection method is applied.

300dpi TIFF Images of pages:

369 370 371 372

Acrobat PDF file of whole paper:

ic950369.pdf

TOP



Use of Generalized Dynamic Feature Parameters for Speech Recognition: Maximum Likelihood and Minimum Classification Error Approaches

Authors:

C. Rathinavelu, University of Waterloo (CANADA)
L. Deng, University of Waterloo (CANADA)

Volume 1, Page 373

Abstract:

In this study we implemented a speech recognizer based on the integrated view, proposed first in [Deng94], on the speech pre-processing and speech modeling problems in the recognizer design. The integrated model we developed generalizes the conventional, currently widely used delta- parameter technique, which has been confined strictly to the pre-processing domain only, in two significant ways. First, the new model contains state-dependent weighting functions responsible for transforming static speech features into the dynamic ones in a slowly time-varying manner. Second, novel maximum-likelihood and minimum-classification-error based learning algorithms are developed for the model that allows joint optimization of the state-dependent weighting functions and the remaining conventional HMM parameters. The experimental results obtained from a standard TIMIT phonetic classification task provide preliminary evidence for the effectiveness of our new, general approaches to the use of the dynamic characteristics of speech spectra.

300dpi TIFF Images of pages:

373 374 375 376

Acrobat PDF file of whole paper:

ic950373.pdf

TOP



A Statistical Pattern Recognition Approach to Robust Recursive Identification of Non-stationary AR Model of Speech Production System

Authors:

Milan Z. Markovic, Institute of Applied Math and Electronics
Branko D. Kovacevic, University of Belgrade
Milan M. Milosavljevic, Institute of Applied Math and Electronics (YUGOSLAVIA)

Volume 1, Page 377

Abstract:

In this work, we propose a robust recursive procedure based on WRLS algorithm with VFF and frame-based quadratic classifier for identification of non-stationary AR model of speech production system. Also, two versions of the frame-based quadratic classifier design procedure, iterative quadratic classifications procedure (CIQC) and its real-time modification (RTQC), are considered. A comparative experimental analysis is done according to the results obtained in analyzing speech signal with voiced and mixed excitation segments. Experimental results justify that two main problems of LPC speech analysis, non-stationarity of LPC parameters and non-appropriateness of AR modeling of speech (particularly on the voiced frames), can be solved by application of the proposed robust procedure. As for the comparison of CIQC and RTQC algorithm, it has been observed that superior results are obtained by using the proposed method with RTQC algorithm and it is recommended for use in the non-stationary AR speech model identification.

300dpi TIFF Images of pages:

377 378 379 380

Acrobat PDF file of whole paper:

ic950377.pdf

TOP



The NP Speech Activity Detection Algorithm

Authors:

Joseph Pencak, Department of Defense (USA)
Douglas Nelson, Department of Defense (USA)

Volume 1, Page 381

Abstract:

This paper describes a new algorithm, the NP algorithm, for detecting speech signals of varying quality and gain level. NP operates in the frequency domain and renders speech/no-speech decisions based on a signal-to- noise ratio (SNR) derived from a sorted power spectrum. In addition to the SNR estimates, a spectral whitening process and an estimate of the variance in the ratio of the signal power to the total energy are also used to identify and reject signals that are stationary or nearly stationary. The key features of this algorithm are: 1. Detection is based on a single FFT. 2. decisions are independent of signal gain. 3. The process has a 3 dB/ocatave processing gain from the transform. 4. Frequency domain processing permits exploition of the signal structure.

300dpi TIFF Images of pages:

381 382 383 384

Acrobat PDF file of whole paper:

ic950381.pdf

TOP



Improved Speech Modeling and Recognition Using Multi-dimensional Articulatory States as Primitive Speech Units

Authors:

L. Deng, University of Waterloo (CANADA)
J. Wu, University of Waterloo (CANADA)
H. Sameti, University of Waterloo (CANADA)

Volume 1, Page 385

Abstract:

In this paper we provide a formal description of a speech recognizer designed on the basis of elaborate articulatory timing that is asynchronous across the multiple articulatory-feature dimensions. Three recently improved critical components of the recognizer are described in detail. Evaluation results, obtained from a standard TIMIT phonetic recognition task confined within the N-best rescoring scenario, are reported on comparative performances between the new feature-based recognizer and a recognizer using the conventional context-dependent triphone units. The results demonstrate an overall superior quality of the rescored N-best list from the feature-based recognizer over that from the triphone-based recognizer. Greater performance improvements are observed as the top number of candidate sentences increases.

300dpi TIFF Images of pages:

385 386 387 388

Acrobat PDF file of whole paper:

ic950385.pdf

TOP



Speech Analysis Based on Malvar Wavelet Transform

Authors:

Christophe Ris, Faculte Polytechnique de Mons (BELGIUM)
Vincent Fontaine, Faculte Polytechnique de Mons (BELGIUM)
Henri Leich, Faculte Polytechnique de Mons (BELGIUM)

Volume 1, Page 389

Abstract:

This paper presents a new pre-processing method developed with the objective to represent relevant information of a signal with a minimum number of parameters. The originality of this work is to propose a new efficient pre- processing algorithm producing acoustical vectors at a variable frame rate. The length of the speech frames is no longer fixed a priori to a constant value but results from a study of the signal stationarity. Both segmentation and signal analysis are based on Malvar wavelets since the orthogonal properties of this transform are the key to the problem of comparing measures done on frames of different lengths.

300dpi TIFF Images of pages:

389 390 391 392

Acrobat PDF file of whole paper:

ic950389.pdf

TOP



Magnitude Spectral Estimation via Poisson Moments with Application to Speech Recognition

Authors:

Samel Celebi, University of Florida (USA)
Jose C. Principe, University of Florida (USA)

Volume 1, Page 393

Abstract:

We proposal to use the Gamma filter as a continuous time spectral feature extractor for the preprocessing of speech signals. The Gamma filter is a simple analog structure which can be implemented as a cascade of identical first order lowpass filters. The filter generates the Poisson moments of the input signal at its taps. These moments carry spectral information about the recent history of the input signal and in return they can be used to construct a time-frequency representation alternative to the conventional methods of short-term Fourier transform, cepstrum, etc. The appeal of the proposed method comes from the fact that in the analog domain the Poisson moments are readily available as a continuous time electrical signal and can be physically measured, rather than computed offline by a digital computer. With this convenience, the speed of the discrete time processor following the preprocessor is independent of the highest frequency of the input signal, but is constrained by the stationary interval of the signal. The moments can be directly fed into artificial neural networks (ANNs) for tasks like classification and identification of time-varying signals like speech.

300dpi TIFF Images of pages:

393 394 395 396

Acrobat PDF file of whole paper:

ic950393.pdf

TOP



Stochastic Perceptual Models of Speech

Authors:

Nelson Morgan, University of California at Berkeley (USA)
Herve Bourlard, Faculte Polytechnique (BELGIUM)
Steven Greenberg, University of California at Berkeley
Hynek Hermansky, Oregon Graduate Institute
Su-Lin Wu, University of California at Berkeley (USA)

Volume 1, Page 397

Abstract:

We have recently developed a statistical model of speech that avoids a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. We further wish to focus statistical modeling power on perceptually-dominant and information-rich portions of the speech signal, which may also be the parts of the speech signal with a better chance to withstand adverse acoustical conditions. We describe here some of the theory, along with some preliminary experiments. These experiments suggest that the regions of acoustic signal containing significant spectral change are critical to the recognition of continuous speech.

300dpi TIFF Images of pages:

397 398 399 400

Acrobat PDF file of whole paper:

ic950397.pdf

TOP