Session: SPEECH-P15
Time: 3:30 - 5:30, Friday, May 11, 2001
Location: Exhibit Hall Area 8
Title: Robust Speech Recognition 2
Chair: Rich Stern

3:30, SPEECH-P15.1
SPEECH IN NOISY ENVIRONMENTS: ROBUST AUTOMATIC SEGMENTATION, FEATURE EXTRACTION, AND HYPOTHESIS COMBINATION
R. SINGH, M. SELTZER, B. RAJ, R. STERN
The first evaluation for Speech in Noisy Environments (SPINE1) was conducted by the Naval Research Labs (NRL) in August 2000. The purpose of this evaluation was to test existing core speech recognition technologies for speech in the presence of varying types and levels of noise. In this case the noises were taken from military settings. Among the strategies used by Carnegie Mellon University's successful systems designed for this task were session-adaptive segmentation, robust mel-scale filtering for the computation of cepstra, the use of parallel front end features and noise-compensation algorithms, and parallel hypothesis combination through word graphs. This paper describes the motivations behind the design decisions taken for these components, supported by observations and experiments.

3:30, SPEECH-P15.2
ADAPTIVE TRANSITION BIAS FOR ROBUST LOW COMPLEXITY SPEECH RECOGNITION
K. KOUMPIS, S. RIIS
The basis for all methods described in this paper is the application of an adaptive transition bias to the sequences of phoneme models that represent spoken utterances. This offers significantly improved accuracy in phoneme based speaker independent recognition, while adding very little overhead to the overall system complexity. The algorithms were tested using the low complexity hybrid recognizer denoted Hidden Neural Networks (HNN) on US English and Japanese speaker independent name dialing tasks. Experimental results show that our approach provides a relative error rate reduction of up to 47% over the baseline system.

3:30, SPEECH-P15.3
EXPERIMENTS WITH AN EXTENDED ADAPTIVE SVD ENHANCEMENT SCHEME FOR SPEECH RECOGNITION IN NOISE
C. UHL, M. LIEB
An extension to adaptive signal subspace methods is presented, based on singular value decomposition (SVD) with an online estimation of the noise variance. With this approach aiming at automatic speech recognition (ASR) in adverse environmental conditions no speech detection has to be performed. A comparison of different SVD approaches and nonlinear spectral subtraction within ASR experiments of different applications is conducted for weakly correlated noise scenarios. Better performance in the case of signal subspace speech enhancement with respect to both accuracy as well as robustness of parameter tuning are reported.

3:30, SPEECH-P15.4
ACOUSTIC SYNTHESIS OF TRAINING DATA FOR SPEECH RECOGNITION IN LIVING ROOM ENVIRONMENTS
V. STAHL, A. FISCHER, R. BIPPUS
Despite continuous progress in robust automatic speech recognition in recent years acoustic mismatch between training and test conditions is still a major problem. Consequently, large speech collections must be conducted in many environments. An alternative approach is to generate training data synthetically by filtering clean speech with impulse responses and adding noise signals from the target domain. We compare the performance of a speech recognizer trained on recorded speech in the target domain with a system trained on suitably transformed clean speech. In order to obtain comparable results, our experiments are based on two channel recordings with a close talk and a distant microphone which produce the clean signal and the target domain signal respectively. By filtering and adding noise we obtain error rates which are only 10 percent higher for natural number recognition and 30 percent higher for a command recognition task compared to training with target domain data.

3:30, SPEECH-P15.5
MAXIMUM-LIKELIHOOD COMPENSATION OF ZERO-MEMORY NONLINEARITIES IN SPEECH SIGNALS
R. MORRIS, M. CLEMENTS
In this paper, an algorithm to blindly compensate zero-memory nonlinear distortions of speech waveforms is derived and analyzed. This method finds a maximum-likelihood estimate of the distortion without a priori knowledge of the microphone characteristics by using the expectation-maximization algorithm. The autoregressive signal model coefficients are solved jointly with the nonlinearity estimate created by an extended Kalman filter. Also, a new family of nonlinear functions is developed for use with this algorithm, although the method can estimate the shape of any parametric zero-memory nonlinearity. These nonlinear distortions can degrade speech recognition rates, yet lower the perceptual quality only slightly. The compensation algorithm improves automatic speech recognition of distorted speech for a variety of such nonlinearities.

3:30, SPEECH-P15.6
IMPROVED NOISE ROBUSTNESS BY CORRECTIVE AND RIVAL TRAINING
C. MEYER, G. ROSE
We show that discriminative training methods have the potential to improve noise robustness even for high resolution acoustic models trained on noisy data. To this end, we compare the performance of acoustic models trained on noisy data using maximum likelihood (ML), corrective (CT) and rival training (RT). Experiments are performed on a German and a Dutch continuous digit string recognition task, yielding improvements in the range of 12% to 35% relative.

3:30, SPEECH-P15.7
CONTINUOUS SPEECH RECOGNITION UNDER NON-STATIONARY MUSICAL ENVIRONMENTS BASED ON SPEECH STATE TRANSITION MODEL
M. FUJIMOTO, Y. ARIKI
In this paper, we propose a non-stationary noise reduction method based on speech state transition model. Our proposed method estimates the speech signal under non-stationary noisy environments such as musical background by applying speech state transition model to Kalman filtering estimation. The speech state transition model represents the state transition of speech component in non-stationary noisy speech and is modeled by using Taylor expansion. In this model, the state transition of noise component is estimated by using linear predictive estimation. In order to evaluate the proposed method, we carried out large vocabulary continuous speech recognition experiments under 3 types of music and compared the results with conventionally used Parallel Model Combination(PMC) method in word accuracy rate. As a result, the proposed method obtained word accuracy rate superior to PMC.

3:30, SPEECH-P15.8
HIGH-PERFORMANCE ROBUST SPEECH RECOGNITION USING STEREO TRAINING DATA
L. DENG, A. ACERO, X. HUANG, J. DROPPO, L. JIANG
We describe a novel technique of SPLICE for high-performance robust speech recognition. It is an efficient noise reduction and channel distortion compensation technique that makes effective use of stereo training data. In this paper, we present a new version of SPLICE using the minimum-mean- square-error decision, and describe an extension by training clusters of HMMs with SPLICE processing. Comprehensive results using a Wall Street Journal large vocabulary recognition task and with a wide range of noise types demonstrate superior performance of the SPLICE technique over that under noisy matched conditions (19% word error rate reduction). The new technique is also shown to consistently outperform the spectral-subtraction noise reduction technique, and is currently being integrated into the Microsoft MiPad, a new generation PDA prototype.

3:30, SPEECH-P15.9
SNR-DEPENDENT WAVEFORM PROCESSING FOR IMPROVING THE ROBUSTNESS OF ASR FRONT-END
D. MACHO, Y. CHENG
In this paper, we introduce a new concept in advancing the noise robustness of speech recognition front-end. The presented method, called SNR-dependent Waveform Processing (SWP), exploits SNR variability within a speech period for enhancing the high SNR period portion and attenuating the low SNR period portion in the waveform time domain. In this way, the overall SNR of noisy speech is increased, and at the same time, the periodicity of voiced speech is enhanced. This approach differs significantly from the well-known speech enhancement techniques, which are mostly frequency domain based, and we use it in this work as a complementary technique to them. In tests with SWP, we present significant clean and noisy speech recognition performance gains using the AURORA 2 database and recognition system as defined by ETSI for the robust front-end standardization process. Moreover, the presented algorithm is very simple and it is attractive also in terms of computational load.

3:30, SPEECH-P15.10
MVDR BASED FEATURE EXTRACTION FOR ROBUST SPEECH RECOGNITION
S. DHARANIPRAGADA, B. RAO
This paper describes a robust feature extraction method for continuous speech recognition. Central to the method is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation and a feature trajectory smoothing technique for reducing the variance in the feature vectors. The above method, when evaluated on continuous speech recognition tasks in a stationary and moving car, gave an average relative improvement in WER of greater than 30%.