SP-14.1

Time-Varying Noise Compensation Using Multiple Kalman Filters
Nam Soo Kim (School of Electrical Engineering, Seoul National University)

The environmental conditions in which a speech recognition system should be operating are usually nonstationary. We present an approach to compensate for the effects of time-varying noise using a bank of Kalman filters. The presented method is based on the interacting multiple model (IMM) technique well-known in the area of multiple target tracking. Moreover, we propose a way to get fixed-interval smoothed estimates for the environmental parameters. The performances of the proposed approaches are evaluated in the continuous digit recognition experiments where not only the slowly evolving noise but also the rapidly varying noise sources are added to simulate the noisy environments.

SP-14.2

A Segment-based C0 Adaptation Scheme for PMC-based Noisy Mandarin Speech Recognition
Wei-Tyng Hong, Sin-Horng Chen (Dep. of Communication Eng., National Chiao Tung University)

A segment-based C0 (the zero-th order of cepstral coefficient) adaptation scheme for PMC-based Mandarin speech recognition is proposed in this paper. It incorporates a new C0 model of speech signal into the PMC method to improve the gain matching between the clean-speech HMM models and the current noise model. The C0 model is constructed in the training phase by jointly modeling the normalized C0 with other MFCC recognition features to form C0-normalized HMM models. In the testing phase, it pre-segments the input utterance into syllable-like segments, performs C0-denormaliztion operations to expand the C0-normalized HMM models, and uses them in the PMC method. Compared with the conventional PMC method, the proposed method can achieve a much better noise compensation effect due to the use of more precise gain matching in the PMC model combination. Experimental results showed that the base-syllable accuracy rate was significantly upgraded for continuous noisy Mandarin speech recognition.

SP-14.3

Improved Parallel Model Combination Techniques With Split Gaussian Mixtures For Speech Recognition Under Noisy Conditions
Jeih-weih Hung (Dept of Electrical Engineering, National Taiwan University), Jia-lin Shen (Institute of Information Science, Academia Sinica), Lin-shan Lee (Dept of Electrical Engineering, National Taiwan University)

The parallel model combination (PMC) technique has been very successful and frequently used to improve the performance of a speech recognition system under noisy environments. In this approach it is assumed that the log spectrum of speech signals is Gaussian-distributed, which is not always valid especially when the number of mixtures in the HMM's is few. In this paper, a simple approach is proposed to improve the PMC method by splitting the mixtures before the domain transformation process in PMC is performed, and merging the mixtures back to the original number after the PMC processes are completed. Preliminary experimental results show that the increased number of mixtures during the PMC processes can in fact provide significant improvements over the original PMC method in terms of the recognition accuracies, especially when the SNR is low.

SP-14.4

Speech Recognition and Enhancement by A Nonstationary AR HMM with Gain Adaptation under Unknown Noise
Ki Yong Lee (School of Electronic Engineering, Soongsil University, 1-1 Sangdo-5Dong, Dongjak-Ku, Seoul, 156-743 Korea), Joohun Lee (Dept. of Information and Telecommunication, Dong-Ah College, Ansung, Korea), Gunther Ruske (Inst. for Human-Machine-Communication, Munich University of Technology, Germany)

In this paper, a gain-adapted speech recognition in unknown noise is developed in time domain. The noise is assumed to be the colored noise. The nonstationary autoregressive (NAR) hidden markov model (HMM) used to model clean speeches. The nonstationary AR is modeled by polynomial functions with a linear combination of M known basis functions. Enhancement using multiple Kalman filters is performed for the gain contour of speech and estimation of noise model when only the noisy signal is available.

SP-14.5

Database and Online Adaptation for improved Speech Recognition in Car Environments
Alexander Fischer (Philips Research Laboratories, Aachen, Germany), Volker Stahl (Philips Research Laboratories Aachen, Germany)

Data collections in the car environment require much more effort in terms of cost and time as compared to the telephone or the office environment. Therefore we apply supervised database adaptation from the telephone environment to the car environment to allow quick setup of car environment recognizers. Further reduction of word error rate is obtained by unsupervised online adaptation during recognition. We investigate the common techniques MLLR and MAP for that purpose. We give results on command word recognition in the car environment for all combinations of database and online adaptation in task-dependent and task-independent scenarios. The possibility of setting up speech recognizers for the car environment based on telephone data and a limited amount of adaptation material from the car environment is demonstrated.

SP-14.6

Training of HMM with Filtered Speech Material for Hands-free Recognition
Diego Giuliani, Marco Matassoni, Maurizio Omologo, Piergiorgio Svaizer (ITC-IRST Centro per la Ricerca Scientifica e Tecnologica)

This paper addresses the problem of hands-free speech recognition in a noisy office environment. An array of six omnidirectional microphones and a corresponding time delay compensation module are used to provide a beamformed signal as input to a HMM-based recognizer. Training of HMMs is performed either using a clean speech database or using a filtered version of the same database. The filtering consists in a convolution with the acoustic impulse response between speaker and microphone, to reproduce the reverberation effect. Background noise is summed to provide the desired SNR. The paper shows that the new models trained on these data perform better than the baseline ones. Furthermore, the paper investigates on MLLR adaptation of the new models. It is shown that a further performance improvement is obtained, allowing to reach a 98.7% WRR in a connected digit recognition task, when the talker is at 1.5 m distance from the array.

SP-14.7

Incremental Enrollment of Speech Recognizers
Chafic E Mokbel (France Telecom - CNET - DIH/DIPS (Currently at IDIAP)), Olivier Collin (France Telecom - CNET - DIH/DIPS)

Classical adaptation approaches generally allow a reliably trained model to match a particular condition. In this paper, we define an incremental version of the segmental-EM algorithm. This method permits to incrementally enrich a model first trained with limited amount of data. Resource memory constraints allow only the initial data statistics to be stored. The proposed method uses these statistics by fixing, within the segmental EM algorithm applied on both initial and new data, the initial optimal paths in the model for the initial data. We proved theoretically that this is equivalent to the segmental MAP adaptation with specific choice of priors. Experimented on two speaker dependent telephone databases, the approach permitted to incrementally integrate new conditions of use. The performance was slightly less than that obtained with classical training over the whole data. As expected with the MAP interpretation of the algorithm, initial data characteristics influence largely the model evolution.

SP-14.8

AUTOMATIC SPEECH RECOGNITION: A COMMUNICATION PERSPECTIVE
Bishnu S Atal (AT&T Labs, Florham Park, NJ 07932, USA)

Speech recognition is usually regarded as a problem in the field of pattern recognition, where one first estimates the probability density function of each pattern to be recognized and then uses Bayes theorem to identify the pattern which provides the highest likelihood for the observed speech data. In this paper, we will take a different approach to this problem. In speech recognition, the goal is communication of information by voice and we will discuss the basics of speech recognition from a communication perspective. The speech signal at the acoustic level has a bit rate of 64 kb/s but the underlying sound patterns have an information rate of less than 100 b/s. What is the role of this high bit rate at the acoustic level? We will discuss the principles of decoding patterns that are submerged in an ocean of seemingly irrelevant information.

< SP-13 SP-15 >

Last Update: February 4, 1999 Ingo Höntsch