Session: SPEECH-L11
Time: 3:30 - 5:30, Friday, May 11, 2001
Location: Room 150
Title: Speech Recognition and Enhancement Using Microphone Arrays
Chair: Kiyohiro Shikano

3:30, SPEECH-L11.1
MICROPHONE ARRAY SUB-BAND SPEECH RECOGNITION
I. MCCOWAN, S. SRIDHARAN
This paper proposes the integration of sub-band speech recognition with a microphone array. A broadband beamforming microphone array allows for natural integration with sub-band speech recognition as the beamformer is typically implemented as a combination of band-limited sub-arrays. Rather than recombining the sub-array outputs to give a single enhanced output, we propose the fusion of separate hidden Markov models trained on each sub-array frequency band. In addition, a dynamic sub-band weighting scheme is proposed in which the cross- and auto-spectral densities of the microphone array inputs are used to estimate the reliability of each frequency band. The microphone array sub-band system is evaluated on an isolated digit recognition task and compared to the standard full-band approach. The results of the proposed dynamic weighting scheme are compared to those obtained using both fixed equal sub-band weights, as well as optimal sub-band weights calculated from a priori knowledge of the correct results.

3:50, SPEECH-L11.2
SPEECH ENHANCEMENT BY MULTIPLE BEAMFORMING WITH REFLECTION SIGNAL EQUALIZATION
T. NISHIURA, S. NAKAMURA, K. SHIKANO
In real environments, the presence of room reverberations seriously degrades the quality in sound capture. To solve this problem, multiple beamforming, which forms directivity not only in the direction of the desired sound source but also in the direction of reflection images, was proposed by J. Flanagan et al. However, it is difficult to apply this method practically in real environments, since this application requires that the distortion of reflection sound signals by wall impedances be equalized. This paper proposes a new multiple beamforming algorithm that equalizes the amplitude-spectrum and phase-spectrum of reflection signals by a cross-spectrum method. Evaluation experiments are conducted in real environment. In a SDR (Signal to Distortion Ratio) evaluation, the proposed multiple beamformer achieves signal distortion reduction more effectively than the conventional single beamformer and the conventional multiple beamformer without equalization. In addition, in an ASR (Automatic Speech Recognition) evaluation, the equalized multiple beamformer achieves a higher recognition performance than those of the above conventional beamformers.

4:10, SPEECH-L11.3
A MICROPHONE ARRAY-BASED 3-D N-BEST SEARCH ALGORITHM FOR THE SIMULTANEOUS RECOGNITION OF MULTIPLE SOUND SOURCES IN REAL ENVIRONMENTS
P. HERACLEOUS, S. NAKAMURA, K. SHIKANO
This paper deals with the recognition of distant talking speech and, particularly, with the simultaneous recognition of multiple sound sources. A problem that must be solved in the recognition of distant talking speech is talker localization. In some approaches, the talker is localized by using short- and long-term power. The 3-D Viterbi search based method proposed by Yamada et al., integrates talker localization and speech recognition. This method provides high recognition rates but its application is restricted to the presence of one talker. In order to deal with multiple talkers, we extended the 3-D Viterbi search method to a 3-D N-best search method enabling the recognition of multiple sound sources. This paper describes our baseline 3-D N-best search-based system and two additional techniques, namely, a likelihood normalization technique and a path distance-based clustering technique. The paper also describes experiments carried out in order to evaluate the performance of the system.

4:30, SPEECH-L11.4
MULTICHANNEL FILTERING FOR OPTIMUM NOISE REDUCTION IN MICROPHONE ARRAYS
D. FLORENCIO, H. MALVAR
This paper introduces a new optimization criterion for the design of microphone arrays, and derives an optimum filter based on this criterion. The algorithm computes two separate correlation matrices for the signal: one for when only background noise is present, and one for when both noise and signal are present. A filter is then computed based on these matrices, optimizing the proposed weighted mean-square error criterion. A block-recursive version of the algorithm is presented, using LMS-like adaptation of the multichannel filters, with a computational complexity under 40 MIPS for a typical application with four microphones. Simulation results with typical office noise show improvements of up to 20 dB in signal-to-noise ratio, even in low-noise environments.

4:50, SPEECH-L11.5
MICROPHONE ARRAY SPEECH DEREVERBERATION USING COARSE CHANNEL MODELING
S. GRIEBEL, M. BRANDSTEIN
This paper presents a model-based method for the enhancement of multi-channel speech acquired under reverberant conditions. A very coarse estimate of the channel responses associated with each source-microphone pair is derived directly from the received data on a short-term basis. These estimates are employed to modify the LPC residuals of the channel data in an effort to deemphasize the effects of reverberant energy in the resulting synthesized signal. The approach is robust to conditions of partial and approximate channel information. Specifically, the incorporated channel model requires only approximate times and amplitudes of the initial multi-path reflections. In practice these impulses are responsible for the bulk of reverberant energy in the received speech signal and can be estimated to a sufficient degree on a time-varying basis.

5:10, SPEECH-L11.6
A MULTI-MICROPHONE SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT
F. JABLOUN, B. CHAMPAGNE
In this paper, we extend the single microphone signal subspace approach for speech enhancement, to a multi-microphone design. In the single microphone case, the trade-off between speech quality and intelligibility is an handicap which limits its performance. This is because it is based on a linear speech model which does not usually offer enough degrees of freedom for noise reduction. In our method, we show how we can easily, and with comparable computational complexity, get more degrees of freedom by using signals from more than one microphone. Experimental results show that this leads to improvements in the noise reduction performance.