Session: SPEECH-P14
Time: 1:00 - 3:00, Friday, May 11, 2001
Location: Exhibit Hall Area 8
Title: Robust Speech Recognition 1
Chair: Yifan Gong

1:00, SPEECH-P14.1
A ROBUST, REAL-TIME ENDPOINT DETECTOR WITH ENERGY NORMALIZATION FOR ASR IN ADVERSE ENVIRONMENTS
Q. LI, J. ZHENG, Q. ZHOU, C. LEE
When automatic speech recognition (ASR) is applied to hands-free or other adverse acoustic environments, endpoint detection and energy normalization can be crucial to the entire system. In low signal-to-noise (SNR) situations, conventional approaches of endpointing and energy normalization often fail and ASR performances usually degrade dramatically. The goal of this paper is to find a fast, accurate, and robust endpointing algorithm for real-time ASR. We propose a novel approach of using a special filter plus a 3-state decision logic for endpoint detection. The filter was designed under several criteria to ensure the accuracy and robustness of detection. The detected endpoints are then applied to energy normalization simultaneously. Evaluation results showed that the proposed algorithm significantly reduced the string error rates on 7 out of 12 tested databases. The reduction rates even exceeded of 50% on two of them. The algorithm only uses one-dimensional energy with 24-frame lookahead; therefore, it has a low complexity and is suitable for real-time ASR.

1:00, SPEECH-P14.2
ROBUST SPEECH/NON-SPEECH DETECTION USING LDA APPLIED TO MFCC
A. MARTIN, L. MAUUARY, D. CHARLET
In speech recognition, a speech/non-speech detection must be robust to noise. In this work, a new method for speech/non-speech detection using a Linear Discriminant Analysis (LDA) applied to Mel Frequency Cepstrum Coefficients (MFCC) is presented. The energy is the most discriminant parameter between noise and speech. But with this single parameter, the speech/non-speech detection system detects too many noise segments. The LDA applied to MFCC and the associated test reduces the detection of noise segments. This new algorithm is compared to the one based on signal to noise ratio (SNR).

1:00, SPEECH-P14.3
FEATURE ENHANCEMENT FOR A BITSTREAM-BASED FRONT-END IN WIRELESS SPEECH RECOGNITION
H. KIM, R. COX
In this paper, we propose a feature enhancement algorithm for wireless speech recognition in adverse acoustic environments. A speech recognition system is realized at the receiver side of a wireless communications system and feature parameters are extracted directly from the bitstream of the speech coder employed in the system. The feature parameters are composed of spectral envelope and coder-specific information. The proposed feature enhancement algorithm incorporates feature parameters obtained from the decoded speech and an enhanced version into the bitstream-based feature parameters. Moreover, the coder-specific parameters are improved by reestimating the codebook gains and residual energy from the enhanced residual signal. HMM-based connected digit recognition experiments show that the proposed feature enhancement algorithm significantly improves recognition accuracy at low SNR without causing poorer performance at high SNR.

1:00, SPEECH-P14.4
CONTINUOUS SPEECH RECOGNITION WITHOUT END-POINT DETECTION
O. SEGAWA, K. TAKEDA, F. ITAKURA
A new continuous speech recognition method that does not need the explicit speech end-point detection is proposed. A one-pass decoding algorithm is modified to decode the input speech of infinite length so that, with appropriate non-speech models for silence and ambient noises, continuous speech recognition can be executed without the explicit end-point detection. The basic algorithm is 1) decode a processing block of the predetermined length, 2) traceback and find the boundaries of the processing blocks where the word history in the preceding processing block is merged into one, and 3) restart decoding from the boundary frame with the merged word history. The effectiveness of the method is verified by the two dictating experiments. With consecutive 100 sentences of utterances from a newspaper, the degradation of the recognition accuracy due to the modification of the decoder is about 5% compared with the results when the correct end-point is given. With a 30 minutes dialogue in a moving car, 75 %correct and 69 %accuracy score is obtained.

1:00, SPEECH-P14.5
ROBUST END-OF-UTTERANCE DETECTION FOR REAL-TIME SPEECH RECOGNITION APPLICATIONS
R. HARIHARAN, J. HÄKKINEN, K. LAURILA
In this paper we propose a sub-band energy based end-of-utterance algorithm that is capable of detecting the time instant when the user has stopped speaking. The proposed algorithm finds the time instant at which many enough sub-band spectral energy trajectories fall and stay for a pre-defined fixed time below adaptive thresholds, i.e. a non-speech period is detected after the end of the utterance. With the proposed algorithm a practical speech recognition system can give timely feedback for the user, thereby making the behaviour of the speech recognition system more predictable and similar across different usage environments and noise conditions. The proposed algorithm is shown to be more accurate and noise robust than the previously proposed approaches. Experiments with both isolated command word recognition and continuous digit recognition in various noise conditions verify the viability of the proposed approach with an average proper end-of-utterance detection rate of around 94% in both cases, representing 43% error rate reduction over the most competitive previously published method.

1:00, SPEECH-P14.6
MULTI-STREAM ASR TRAINED WITH HETEROGENEOUS REVERBERANT ENVIRONMENTS
M. SHIRE
A common problem with current automatic speech recognition (ASR) systems is that the performance degrades when it is presented with speech from a different acoustic environment than the one used during training. An important cause is that the feature distribution to which the ASR system is trained no longer matches that of a new environment. Reverberant environments can be especially harmful. In this work, we test a multi-stream system in which the constituent streams are each trained in separate acoustic environments. When training the acoustic modeling stages of the streams separately with clean data and heavily reverberated data, we find that that the combined system can improve the ASR performance with unseen reverberated test data.

1:00, SPEECH-P14.7
ADAPTIVE ML-WEIGHTING IN MULTI-BAND RECOMBINATION OF GAUSSIAN MIXTURE ASR
A. HAGEN, H. BOURLARD, A. MORRIS
Multi-band speech recognition is powerful in band-limited noise, when the recognizer of the noisy band, which is less reliable, can be given less weight in the recombination process. An accurate decision on which bands can be considered as reliable and which bands are less reliable due to corruption by noise is usually hard to take. In this article, we investigate a maximum-likelihood (ML) approach to adapting the combination weights of a multi-band system. The Gaussian Mixture Model parameters are kept constant, while the combination weights are iteratively updated to maximize the data likelihood. Unsupervised offline and online weights adaptation are compared to use of equal weights, and `cheating' weights where the noisy band is known, as well as to the fullband system. Initial tests show that both ML-weighting strategies show a robustness gain on band-limited noise.

1:00, SPEECH-P14.8
ROBUST SPEECH RECOGNITION IN BURST-LIKE PACKET LOSS
B. MILNER
This paper examines problems associated with performing speech recognition over mobile and IP networks. Main problems are identified as codec-based distortion and from speech vectors being lost from packet loss in the network. A realistic model for packet loss is developed, based on a three state Markov model and is shown to be capable of simulating the burst-like nature of packet loss. A two stage packet loss detection and estimation scheme is proposed and is shown to improve recognition performance in the event of feature vectors being lost. Results from the Aurora database show that burst-like packet loss reduces digit accuracy from 99% to 57% at 50% packet loss. Estimation of lost packets recovers performance to 77%.

1:00, SPEECH-P14.9
AUTOMATIC TRANSCRIPTION OF COMPRESSED BROADCAST AUDIO
C. BARRAS, L. LAMEL, J. GAUVAIN
With increasing volumes of audio and video data broadcast over the web, it is of interest to assess the performance of state-of-the-art automatic transcription systems on compressed audio data for media indexation applications. In this paper the performance of the LIMSI 10x French broadcast news transcription system is measured on a two-hour audio set for a range of MP3 and RealAudio codecs at various bitrates and the GSM codec used for European cellular phone communications. The word error rates are compared with those obtained on high quality PCM recordings prior to compression. For a 6.5 kbps audio bit rate (the most commonly used on the web), word error rates under 40% can be achieved, which makes automatic media monitoring systems over the web a realistic task.

1:00, SPEECH-P14.10
SOFT-FEATURE DECODING FOR SPEECH RECOGNITION OVER WIRELESS CHANNELS
A. POTAMIANOS, V. WEERACKODY
A distributed automatic speech recognition (ASR) system is considered where features of the speech signal are extracted at the wireless terminal and transmitted to a centralized ASR server. An unequal error protection scheme is used for the quantized ASR feature stream. At the receiver coherent demodulation is performed and the probability of error for each bit is computed using the Max-Log MAP algorithm. A `soft-feature' decoding strategy is introduced at the ASR server that uses the marginal distribution of only the reliable features during likelihood computation. Alternatively, the confidence of each feature is computed from the bit error probabilities and each feature in the probability computation is weighted as a function of the feature confidence. The performance of the proposed soft-feature algorithms is evaluated over typical cellular wireless channels and it is shown to reduce ASR error rate by up to 30% for certain channels at no additional computational cost.