Chair: Yariv Ephraim, George Mason University (USA)
P.D. Green, University of Sheffield (UK)
M.P. Cooke, University of Sheffield (UK)
M.D. Crawford, University of Sheffield (UK)
We describe a novel paradigm for automatic speech recognition in noisy environments in which an initial stage of auditory scene analysis separates out the evidence for the speech to be recognised from the evidence for other sounds. In general, this evidence will be incomplete, since intruding sound sources will dominate some spectro-temporal regions. We generalise continuous-density hidden Markov model recognition to this `occluded speech' case. The technique is based on estimating the probability that a Gaussian mixture density distribution for an auditory firing rate map will generate an observation such that the separated components are at their observed values and the remaining components are not greater than their values in the acoustic mixture. Experiments on isolated digit recognition in noise demonstrate the potential of the new approach to yield performance comparable to that of listeners.
Hynek Hermansky, Oregon Graduate Institute of Science & Technology (USA)
Eric A. Wan, Oregon Graduate Institute of Science & Technology (USA)
Carlos Avendano, Oregon Graduate Institute of Science & Technology (USA)
Finite Impulse Response (FIR) Wiener-like filters are applied to time trajectories of cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of processed noisy speech while not causing any significant degradation to clean speech. Alternative filter structures are being investigated as well as other potential applications in cellular channel compensation and narrowband to wideband speech mapping.
Sumeet Sandhu, AT&T Bell Laboratories (USA)
Oded Ghitza, AT&T Bell Laboratories (USA)
The performance of current large-vocabulary automatic speech recognition (ASR) systems deteriorates severely in mismatched training and testing conditions. Signal processing techniques based on the human auditory system have been proposed to improve ASR performance, especially under adverse acoustic conditions. This paper compares one such scheme, the Ensemble Interval Histogram (EIH), with the conventional mel cepstral analysis (MEL). These two spectral feature extraction methods were implemented as front ends to a state-of-the-art continuous speech recognizer and evaluated on the TIMIT database (male). To characterize the influence of signal distortion on the representation of different sounds, phone classification experiments were conducted for three acoustic conditions - clean speech, speech through a telephone channel and speech under room reverberations (the last two are simulations). Classification was performed for static features alone and for static and dynamic features, to observe the relative contribution of time derivatives. The performance is displayed here as percentage of phones correctly classified. Confusion matrices were also derived from phone classification to provide diagnostic information.
Khaled T. Assaleh, Motorola GSTG (USA)
A new set of LP-derived features is introduced. The concept of these features is motivated by the power sum formulation of the LP cepstrum. Due to the fact that the LP model implies that the resulting poles are either real or occur in complex conjugate pairs, the power sum of the poles is equivalent to the power sum of their real components. Therefore, the LP cepstrum is associated to the power sum of the real component of the LP poles. This fact is utilized in deriving a new set of features that is associated to the imaginary components of the LP poles. We refer to this new set of features as the sepstral coefficients. We have found that the sepstral coefficients and cepstral coefficients are relatively uncorrelated. Hence, they can be used jointly to improve the performance of pattern classification applications where cepstral features are usually used. In this paper we present some preliminary results on speaker identification experiments.
Engin Erzin, Bilkent University
A. Enis Cetin, KoC University
Yasemin Yardimci, Bo aziCi University (TURKEY)
In this paper, a new set of speech feature representations for robust speech recognition in the presence of car noise are proposed. These parameters are based on subband analysis of the speech signal. Line Spectral Frequency (LSF) representation of the Linear Prediction (LP) analysis in subbands and cepstral coefficients derived from subband analysis (SUBCEP) are introduced, and the performances of the new feature representations are compared to mel scale cepstral coefficients (MELCEP) in the presence of car noise. Subband analysis based parameters are observed to be more robust than the commonly employed MELCEP representations.
Shoji Kajita, Nagoya University (JAPAN)
Fumitada Itakura, Nagoya University (JAPAN)
This paper describes to what extent the subband-autocorrelation(SBCOR) analysis is robust against waveform distortion and noises. The SBCOR analysis, which has been already proposed, is a signal processing technique based on subband processing and autocorrelation analysis so as to extracts periodicities present in speech signals. First, it is shown that SBCOR is robust against severe waveform distortions such as zero- crossing. Although the zero-crossing distortion deteriorates the performance of conventional recognition systems, such distorted signals are still intelligible for humans. The experimental results using a DTW word recognition show that the SBCOR(Q=1.0) performs about 19% higher than smoothed group delay spectrum(SGDS), when the test signals are distorted by zero-crossing. Second, it is shown that SBCOR is more robust against multiplicative signal-dependent white noise, Gaussian white noise, and a human speech noise than SGDS. The validity of the SBCOR is larger when the noise is white than when the noise is the human speech noise.
Alfred Hauenstein, Siemens AG (GERMANY)
Erwin Marschall, Siemens AG (GERMANY)
Robust modelling and fast adaptation to changes in transmission channels has yielded significant improvements in speech recognition over telephone lines. Robust modelling is achieved by using a special version of the LDA-transformation including a two frame context and subtraction of the mean channel seen in training. A fast maximum likelihood channel adaptation copes with variations in characteristics of transmission channel and speaker during real world operation. Evaluation of these techniques on different databases demonstrates reductions of word error rates up to 70%, suggesting that significant improvements in recognition performance may be achieved by better acoustic-phonetic modelling and fast adaptation.
Asuncion Moreno, Universitat Politecnica de Catalunya (SPAIN)
Sergio Tortola, Universitat Politecnica de Catalunya (SPAIN)
Josep Vidal, Universitat Politecnica de Catalunya (SPAIN)
Jose A. R. Fonollosa, Universitat Politecnica de Catalunya (SPAIN)
In this paper the problem of recognition in noisy environments is addressed . Often, a recognition system is used in a noisy environment and there is no possibility of training it with noisy samples. Classical speech analysis techniques are based on second-order statistics and their performance dramatically decreases when noise is present in the signal under analysis. In this paper new methods based on Higher-Order Statistics (HOS) are applied in a recognition system and compared against the autocorrelation method. Cumulant-based methods show better performance than autocorrelation -based methods for low SNR.
Ruikang Yang, Nokia Research Center (FINLAND)
Petri Haavisto, Nokia Research Center (FINLAND)
It is a very important issue to achieve reliable performance for speech recognition systems in noisy environments. One application is the hands-free mobile phone in a car, where the user can access telephone functions through voice. In this paper, a noise compensation algorithm for HMM based recognition systems is presented. The algorithm was tested by using TIDIGITS database with additive car noise. Very promising results were obtained. The results show that at -10 dB SNR the recognition accuracy could be improved from 34% to 89%. The noise compensation algorithm was also applied to a car speech database which was recorded at parking place, downtown and highway. Improved performance was obtained.
S.V. Vaseghi, University of East Anglia (UK)
B.P. Milner, University of East Anglia (UK)
This paper presents experimental results on the use of noise compensation schemes with hidden Markov model (HMM) speech recognition systems operating in the presence of impulsive noise. A measure of signal to impulsive noise ratio is introduced, and the effects of varying the percentage of impulsive noise contamination, and the power of impulsive noise, on speech recognition are investigated. For the modelling of an impulsive noise process, an amplitude- modulated binary sequence model and a binary-state HMM are considered. For impulsive noise compensation a front-end method and a noise-adaptive method are evaluated. Experiments demonstrate that the noise compensation methods achieve a substantial improvement in speech recognition accuracy across a wide range of signal to impulsive noise ratios.