SP-13.1

Investigations on Inter-Speaker Variability in the Feature Space
Reinhold Haeb-Umbach (Philips Research Laboratories)

We apply Fisher variate analysis to measure the effectiveness of speaker normalization techniques. A trace criterion, which measures the ratio of the variations due to different phonemes compared to variations due to different speakers, serves as a first assessment of a feature set without the need for recognition experiments. By using this measure and by recognition experiments we demonstrate that cepstral mean normalization also has a speaker normalization effect, in addition to the well-known channel normalization effect. Similarly vocal tract normalization (VTN) is shown to remove inter-speaker variability. For VTN we show that normalization on a per sentence basis performs better than normalization on a per speaker basis. Recognition results are given on Wallstreet Journal and Hub-4 databases.

SP-13.2

LSP Weighting Functions Based on Spectral Sensitivity and Mel-Frequency Warping for Speech Recogntion in Digital Communication
Seung Ho Choi (Dept. of Electrical Eng., Korea Advanced Institute of Science and Technology, 373-1 Kusong-Dong, Yusong-Ku, Taejon 305-701, Korea), Hong Kook Kim (AT&T Labs Research, Rm. E148, 180 Park Avenue, Florham Park NJ 07932, USA), Hwang Soo Lee (Central Research Laboratory, SK Telecom, 58-4 Hwaam-Dong, Yusong-Gu, Taejon 305-348, Korea)

In digital communication networks, a speech recognition system extracts feature parameters after reconstructing speech signals. In this paper, we consider a useful approach of incorporating speech coding parameters into a speech recognizer. Most speech coders employ line spectrum pairs (LSPs) to represent spectral parameters. We introduce weighted distance measures to improve the recognition performance of an LSP-based speech recognizer. Experiments on speaker-independent connected-digit recognition showed that weighted distance measures provide better recognition accuracy than unweighted distance measures do. Compared with a conventional method employing mel-frequency cepstral coefficients, the proposed method achieved higher performance in terms of a recognition accuracy.

SP-13.3

Two-Dimensional Multi-Resolution Analysis of Speech Signals and Its Application to Speech Recognition
Chun-ping Chan, Yiu-wing Wong, Tan Lee, Pak-chung Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)

This paper describes a novel approach of using multi-resolution analysis (MRA) for automatic speech recognition. Two-dimensional MRA is applied to the short-time log spectrum of speech signal to extract the slowly varying spectral envelope that contains the most important articulatory and phonetic information. After passing through a standard cepstral analysis process, the MRA features are used for speech recognition in the same way as conventional short-time features like MFCCs, PLPs, etc. Preliminary experiments on both clean connected speech and noisy telephone conversation speech show that the use of MRA cepstra results in a significant reduction in insertion error when compared with MFCCs.

SP-13.4

Hierarchical Subband Linear Predictive Cepstral (HSLPC) Features for HMM-Based Speech Recognition
Rathinavelu Chengalvarayan (Lucent Technologies Inc.)

In this paper, a new approach for linear prediction (LP) analysis is explored, where predictor can be computed from a mel-warped subband-based autocorrelation functions obtained from the power spectrum. For spectral representation a set of multi-resolution cepstral features are proposed. The general idea is to divide up the full frequency-band into several subbands, perform the IDFT on the mel power spectrum for each subband, followed by Durbin's algorithm and the standard conversion from LP to cepstral coefficients. This approach can be extended to several levels of different resolutions. Muti-resolution feature vectors, formed by concatenation of the subband cepstral features into an extended feature vector, are shown to yield better performance than the conventional mel-warped LPCCs over the full voice-bandwidth for connected digit recognition task.

SP-13.5

Towards a Robust/Fast Continuous Speech Recognition System Using a Voiced-Unvoiced Decision
Douglas O'Shaughnessy, Hesham Tolba (INRS-Telecommunications)

In this paper, we show that the concept of Voiced-Unvoiced (V-U) classification of speech sounds can be incorporated not only in speech analysis or speech enhancement processes, but also can be useful for recognition processes. That is, the incorporation of such a classification in a continuous speech recognition (CSR) system not only improves its performance in low SNR environments, but also limits the time and the necessary memory to carry out the process of the recognition. The proposed V-U classification of the speech sounds has two principal functions: (1) it allows the enhancement of the voiced and unvoiced parts of speech separately; (2) it limits the Viterbi search space, and consequently the process of recognition can be carried out in real time without degrading the performance of the system. We prove via experiments that such a system outperforms the baseline HTK when a V-U decision is included in both front- and far-end of the HTK-based recognizer.

SP-13.6

A C/V SEGMENTATION ALGORITHM FOR MANDARIN SPEECH SIGNAL BASED ON WAVELET TRANSFORMS
Jhing-Fa Wang (Department of Electrical Engineering & Department of Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.), Shi-Huang Chen (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.)

This paper proposes a new consonant/vowel (C/V) segmentation algorithm for Mandarin speech signal. Since the Mandarin phoneme structure is a combination of a consonant (may be null) followed by a vowel, the C/V segmentation is an important part in the Mandarin speech recognition system. Based on the wavelet transform, the proposed method can directly search for the C/V segmentation point by using a product function and energy profile. The product function is generated from the appropriate wavelet and scaling coefficients of input speech signal and it can be applied to indicate the C/V segmentation point. With this product function and the additional verification of energy profile, the C/V segmentation point can be accurately pointed out with a low computation complexity. Experiments are provided that demonstrate the superior performance of the proposed algorithm. An overall accuracy rate of 97.2% is achieved. This algorithm is suitable for Mandarin speech recognition task.

SP-13.7

Feature Extraction for Speech Recognition Based on Orthogonal Acoustic Feature Planes and LDA
Tsuneo Nitta (Toyohashi University of Technology)

This paper describes an attempt to extract multiple topological structures, hidden in time-spectrum (TS) patterns, by using multiple mapping functions, and to incorporate the functions into the feature extractor of a speech recognition system. In the previous work, the author proposed a novel feature extraction method based on MAFP/KLT (MAFP: multiple acoustic feature planes), in which 3*3 derivative operators were used for mapping functions, and showed that the method achieved significant improvement in preliminary experiments. In this paper, firstly, the mapping functions are directly extracted in the form of a 3*3 orthogonal basis from a speech database. Next, the functions are evaluated, together with 3*3 simplified operators modeled on the orthogonal basis. Finally, after comparing the experimental results, the author proposes an effective feature extraction method based on MAFP/LDA, in which a Sobel operator is used for mapping functions.

SP-13.8

Distinctive Feature Detection Using Support Vector Machines
Partha Niyogi, Chris Burges, Padma Ramesh (Bell Labs, Lucent Technologies, USA.)

An important aspect of distinctive feature based approaches to automatic speech recognition is the formulation of a framework for robust detection of these features. We discuss the application of the support vector machines (SVM) that arise when the structural risk minimization principle is applied to such feature detection problems. In particular, we describe the problem of detecting stop consonants in continuous speech and discuss an SVM framework for detecting these sounds. In this paper we use both linear and nonlinear SVMs for stop detection and present experimental results to show that they perform better than a cepstral features based hidden Markov model (HMM) system, on the same task.

< SP-12 SP-14 >

Last Update: February 4, 1999 Ingo Höntsch