SP-25.1

SUBSPACE STATE SPACE MODEL IDENTIFICATION FOR SPEECH ENHANCEMENT
Eric J Grivel, Marcel G Gabrea, Mohamed Najim (Equipe Signal et Image, B.P. 99, F-33 402 Talence Cedex, France.)

This paper deals with Kalman filter-based enhancement of a speech signal contaminated by a white noise, using a single microphone system. Such a problem can be stated as a realization issue in the framework of identification. For such a purpose we propose to identify the state space model by using subspace non-iterative algorithms based on orthogonal projections. Unlike Estimate-Maximize (EM)-based algorithms, this approach provides, in a single iteration from noisy observations, the matrices related to state space model and the covariance matrices that are necessary to perform Kalman filtering. In addition no voice activity detector is required unlike existing methods. Both methods proposed here are compared with classical approaches.

SP-25.2

Using AR HMM State-Dependent Filtering for Speech Enhancement
Driss Matrouf (LIMSI-CNRS (France)), Jean-Luc Gauvain (LIMSI-CNRS (france))

In this paper we address the problem of enhancing speech which has been degraded by additive noise. As proposed by Ephraim et~al., autoregressive hidden Markov models (AR-HMM) for the clean speech and an autoregressive Gaussian for the noise are used. The filter applied to a given frame of noisy speech is estimated using the noise model and the autoregressive Gaussian having the highest a posteriori probability given the decoded state sequence. The success of this technique is highly dependent on accurate estimation of the best state sequence. A new strategy combining the use of cepstral-based HMMs, autoregressive HMMs, and a model combination technique, is proposed. The intelligibility of the enhanced speech is indirectly assessed via speech recognition, by comparing performance on noisy speech with compensated models to performance on the enhanced speech with clean-speech models. The results on enhanced speech are as good as our best results obtained with noise compensated models.

SP-25.3

Tracking Speech-Presence Uncertainty to Improve Speech Enhancement In Non-Stationary Noise Environments
David Malah, Richard V Cox, Anthony J Accardi (AT&T Labs - Research, Florham Park, NJ 07932)

Speech enhancement algorithms which are based on estimating the short-time spectral amplitude of the clean speech have better performance when a soft-decision gain modification, depending on the a priori probability of speech absence, is used. In reported works a fixed probability, q, is assumed. Since speech is non-stationary and may not be present in every frequency bin when voiced, we propose a method for estimating distinct values of q for different bins which are tracked in time. The estimation is based on a decision-theoretic approach for setting a threshold in each bin followed by short-time averaging. The estimated q's are used to control both the gain and the update of the estimated noise spectrum during speech presence in a modified MMSE log-spectral amplitude estimator. Subjective tests resulted in higher scores than for the IS-127 standard enhancement algorithm, when pre-processing noisy speech for a coding application.

SP-25.4

Adaptive Two-Band Spectral Subtraction with Multi-window Spectral Estimation
Chuang He (Los Alamos National Laboratory, T Division, MS B276, Los Alamos, NM 87545), George Zweig (Signition, Inc., 901 18th Street, Los Alamos, NM 87544)

An improved spectral subtraction algorithm for enhancing speech corrupted by additive wideband noise is described. The artifactual noise introduced by spectral subtraction that is perceived as musical noise is 7 dB less than that introduced by the classical spectral subtraction algorithm of Berouti et al. Speech is decomposed into voiced and unvoiced sections. Since voiced speech is primarily stochastic at high frequencies, the voiced speech is high-pass filtered to extract its stochastic component. The cut-off frequency is estimated adaptively. Multi-window spectral estimation is used to estimate the spectrum of stochastically voiced and unvoiced speech, thereby reducing the spectral variance. A low-pass filter is used to extract the deterministic component of voiced speech. Its spectrum is estimated with a single window. Spectral subtraction is performed with the classical algorithm using the estimated spectra. Informal listening tests confirm that the new algorithm creates significantly less musical noise than the classical algorithm.

SP-25.5

Speech Enhancement Using Voice Source Models
Anisa Yasmin (Department of Electrical and Computer Engineering, University of Waterloo), Paul W Fieguth (Dept. of Systems Design - University of Waterloo), Li Deng (Dept. of Electrical and Computer Engineering, University of Waterloo)

Autoregressive (AR) models have been shown to be effective models of the human vocal tract during voicing. However the most common model of speech for enhancement purposes, an AR process excited by white noise, fails to capture the periodic nature of voiced speech. Speech synthesis researchers have long recognized this problem and have developed a variety of sophisticated excitation models, however these models have yet to make an impact in speech enhancement. We have chosen one of the most common excitation models, the four-parameter LF model of Fant, Liljencrants and Lin, and applied it to the enhancement of individual voiced phonemes. Comparing the performance of the conventional white-noise-driven AR, an impulse-driven AR, and an AR based on the LF model shows that the LF model yields a substantial improvement, on the order of 1.3 dB.

SP-25.6

Adaptive Decorrelation Filtering for Separation of Co-Channel Speech Signals from M > 2 Sources
Kuan-Chieh Yen (University of Illinois at Urbana-Champaign (USA)), Yunxin Zhao (University of Missouri - Columbia (USA))

The ADF algorithm for separating two signal sources by Weinstein, Feder, and Oppenheim is generalized for separation of co-channel speech signals from more than two sources. The system configuration, its accompanied ADF algorithm, and the choice of adaptation gain are derived. The applicability and limitation of the derived algorithm are also discussed. Experiments were conducted for separation of three speech sources with the acoustic paths measured from an office environment, and the algorithm was shown to improve the average target-to-interference ratio for the three sources by approximately 15 dB.

SP-25.7

Audio Signal Noise Reduction Using Multi-resolution Sinusoidal Modeling
David V Anderson, Mark A Clements (Georgia Institute of Technology)

The sinusoidal transform (ST) provides a sparse representation for speech signals by utilizing several psychoacoustic phenomena. It is well suited to applications in signal enhancement because the signal is represented in a parametric manner that is easy to manipulate. The multi--resolution sinusoidal transform (MRST) has the additional advantage that it is both particularly well suited to typical speech signals and well matched to the human auditory system. The currently reported work discusses the removal of noise from a noisy signal by applying an adaptive Wiener filter to the MRST parameters and then conditioning the parameters to eliminate ``musical noise.'' In informal tests MRST based noise reduction was found to reduce background noise significantly better than traditional Wiener filtering and to virtually eliminate the ``musical noise'' often associated with Wiener filtering.

SP-25.8

Utilizing Interband Acoustical Information for Modeling Stationary Time-Frequency Regions of Noisy Speech
Chang D Yoo (Korea Telecom)

A novel enhancement system is developed that exploits the properties of staionary regions localized in both time and frequency. This system selects stationary time-frequency regions and adaptively enhances each region according to its local signal-to-noise ratio while utilizing both the acoustical knowledge of speech and the masking properties of the human auditory system. Each regon is enhanced for maximum noise reduction while minimizing distortion. This paper evaluates the proposed sytem through informal listening tests and some objective measures.

< SP-24 SP-26 >

Last Update: February 4, 1999 Ingo Höntsch