SP-9.1

The Teager Energy Based Feature Parameters for Robust Speech Recognition in Car Noise
Firas Jabloun (INRS-telecommunications (Canada)), Enis A Cetin (Bilkent University (Turkey))

In this paper, a new set of speech feature parameters based on multirate signal processing and the Teager Energy Operator is developed. The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filter-bank, then the Teager energies of the subsignals are estimated. Finally, the feature vector is constructed by log-compression and inverse DCT computation. The new feature parameters have a robust speech recognition performance in car engine noise which is low pass in nature.

SP-9.2

AVOIDING DISTORTIONS DUE TO SPEECH CODING AND TRANSMISSION ERRORS IN GSM ASR TASKS
Ascensión Gallardo-Antolín, Fernando Díaz-de-María, Francisco Valverde-Albacete (Departamento de Tecnologías de las Comunicaciones. EPS-Universidad Carlos III de Madrid)

In this paper, we have extended our previous research on a new approach to ASR in the GSM environment. Instead of recognizing from the decoded speech signal, our system works from the digital speech representation used by the GSM encoder. We have compared the performance of a conventional system and the one we propose on a speaker independent, isolated- digit ASR task. For the half and full-rate GSM codecs, from our results, we conclude that the proposed approach is much more effective in coping with the coding distortion and transmission errors. Furthermore, in clean speech conditions, our approach does not impoverish the recognition performance, even recognizing from GSM digital speech, in comparison with a conventional system working on unencoded speech.

SP-9.3

Binaural Bark subband Preprocessing of nonstationary Signals for noise robust Speech Feature Extraction
Mike Peters (BMW AG Research and Development 80788 Munich, Germany)

A two channel approach to noise robust feature extraction for speech recognition in the car is proposed. The coherence function within the Bark subbands of the MFCC Transform is calculated to estimate the spectral similarity of two statistic processes. It is illustrated how the coherence of speech in binaural signals is used to increase the robustness against incoherent noise. The introduced preprocessing of nonstationary signals in two microphones results in an additive correction term of the Mel-Frequency-Cepstral-Coeeficients.

SP-9.4

Speaker Normalized Spectral Subband Parameters for Noise Robust Speech Recognition
Satoru Tsuge, Toshiaki Fukada, Harald Singer (ATR-ITL, JAPAN)

This paper proposes speaker normalized spectral subband centroids (SSCs) as supplementary features in noise environment speech recognition. SSCs are computed as frequency centroids for each subband from the power spectrum of the speech signal. Since the conventional SSCs depend on formant frequencies of a speaker, we introduce a speaker normalization technique into SSC computation to reduce the speaker variability. Experimental results on spontaneous speech recognition show that the speaker normalized SSCs are more useful as supplementary features for improving the recognition performance than the conventional SSCs.

SP-9.5

TempoRAl Patterns (TRAPs) In ASR Of Noisy Speech
Hynek Hermansky, Sangita Sharma (Oregon Graduate Institute of Science and Technology)

In this paper we study a new approach to processing temporal information for automatic speech recognition (ASR). Specifically, we study the use of rather long-time TempoRAl Patterns (TRAPs) of spectral energies in place of the conventional spectral patterns for ASR. The proposed Neural TRAPs are found to yield significant amount of complementary information to that of the conventional spectral feature based ASR system. A combination of these two ASR systems is shown to result in improved robustness to several types of additive and convolutive environmental degradations. ~

SP-9.6

Signal Modeling for Isolated Word Recognition
Montri Karnjanadecha, Stephen A Zahorian (Old Dominion University)

This paper presents speech signal modeling techniques which are well suited to high performance and robust isolated word recognition. Speech is encoded by a discrete cosine transform of its spectra, after several preprocessing steps. Temporal information is then also explicitly encoded into the feature set. We present a new technique for incorporating this temporal information as a function of temporal position within each word. We tested features computed with this method using an alphabet recognition task based on the ISOLET database. The HTK toolkit was used to implement the isolated word recognizer with whole word HMM models. The best result obtained based on 50 features and speaker independent alphabet recognition was 98.0%. Gaussian noise was added to the original speech to simulate a noisy environment. We achieved a recognition accuracy of 95.8% at a SNR of 15 dB. We also tested our recognizer with simulated telephone quality speech by adding noise and band limiting the original speech. For this "telephone" speech, our recognizer achieved 89.6% recognition accuracy. The recognizer was also tested in a speaker dependent mode, resulting in 97.4% accuracy on test data.

SP-9.7

Transforming HMMs For Speaker-Independent Hands-Free Speech Recognition in the Car
Y. Gong, John J. Godfrey (Texas Instruments Inc.)

In the absence of HMMs trained with speech collected in the target environment, one may use HMMs trained with a large amount of speech collected in another recording condition (e.g., quiet office, with high quality microphone.) However, this may result in poor performance because of the mismatch between the two acoustic conditions. We propose a linear regression-based model adaptation procedure to reduce such a mismatch. With some adaptation utterances collected for the target environment, the procedure transforms the HMMs trained in a quiet condition to maximize the likelihood of observing the adaptation utterances. The transformation must be designed to maintain speaker-independence of the HMM. Our speaker-independent test results show that with this procedure about 1% digit error rate can be achieved for hands-free recognition, using target environment speech from only 20 speakers

SP-9.8

Channel and Noise Adaptation via HMM Mixture Mean Transform and Stochastic Matching
Shuen Kong Wong (Department of EEE/ HK Univ. of Sci. and Tech.), Bertram Shi (Department of EEE/HK Univ. of Sci. and Tech.)

We present a non-linear model transformation for adapting Gaussian Mixture HMMs using both static and dynamic MFCC observation vectors to additive noise and constant system tilt. This transformation depends upon a few compensation coefficients which can be estimated from channel distorted speech via Maximum-Likelihood stochastic matching. Experimental results validate the effectiveness of the adaptation. We also provide an adaptation strategy which can result in improved performance at reduced computational cost compared with a straightforward implementation of stochastic matching.

< SP-8 SP-10 >

Last Update: February 4, 1999 Ingo Höntsch