Session: SPEECH-P7
Time: 9:30 - 11:30, Thursday, May 10, 2001
Location: Exhibit Hall Area 7
Title: Audio-Visual Integration for Speech Recognition and Analysis
Chair: Chalapathy Neti

9:30, SPEECH-P7.1
VISUAL SPEECH SYNTHESIS USING QUADTREE SPLINES
X. CHEN, J. YANG
In this paper, we present a method for synthesizing photo-realistic visual speech using a parametric model based on quadtree splines. In an image-based visual speech synthesis system, visemes are used for generating an arbitrary new image sequence. The images between visemes are usually synthesized using a certain mapping. Such a mapping can be characterized by motion parameters estimated from the training data. With the quadtree splines, we can minimize the number of motion parameters for a given synthetic error. The feasibility of the proposed method has been demonstrated by experiments.

9:30, SPEECH-P7.2
NOISE COMPENSATION IN A MULTI-MODAL VERIFICATION SYSTEM
C. SANDERSON, K. PALIWAL
In this paper we propose an adaptive multi-modal verification system comprised of a modified Minimum Cost Bayesian Classifier (MCBC) and a method to find the reliability of the speech expert for various noisy conditions. The modified MCBC takes into account the reliability of each modality expert, allowing the de-emphasis of the contribution of opinions from the expert affected by noise. Reliability of the speech expert is found without directly modeling the noisy speech or finding the reliability a priori for various conditions of the speech signal. Experiments on the Digit Database show the Total Error (TE) to be reduced by 78% when compared to a non-adaptive system.

9:30, SPEECH-P7.3
OPTIMAL WEIGHTING OF POSTERIORS FOR AUDIO-VISUAL SPEECH RECOGNITION
M. HECKMANN, F. BERTHOMMIER, K. KROSCHEL
We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audio-visual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of conditional independence of the audio and video data and the contribution of only one of the two paths when the SNR is very high or very low, respectively. In the case of the geometric weighting a new weighting scheme is developed whereas the linear weighting follows the Full Combination approach as usually employed in multi-stream recognition. We compare these two new concepts in audio-visual recognition to a rather standard approach known from the literature. Recognition tests were performed in a continuous number recognition task on a single speaker database containing 1221 utterances with two different types of noise added.

9:30, SPEECH-P7.4
HIERARCHICAL DISCRIMINANT FEATURES FOR AUDIO-VISUAL LVCSR
G. POTAMIANOS, J. LUETTIN, C. NETI
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied on MFCC based audio-only features, as well as on visual-only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied on the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.

9:30, SPEECH-P7.5
ASYNCHRONOUS STREAM MODELING FOR LARGE VOCABULARY AUDIO-VISUAL SPEECH RECOGNITION
J. LUETTIN, G. POTAMIANOS, C. NETI
This paper addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood estimation. Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task, show that best performance is obtained by asynchronous stream integration. This system reduces the error rate at a 10 dB SNR with additive speech "babble" noise by 27 % relative over audio-only models and by 22 % relative over traditional audio-visual models using concatenative feature fusion. Slide Show: http://www.clsp.jhu.edu/ws2000/groups/av_speech/ Report: http://www.clsp.jhu.edu/ws2000/final_reports/avsr/

9:30, SPEECH-P7.6
WEIGHTING SCHEMES FOR AUDIO-VISUAL FUSION IN SPEECH RECOGNITION
H. GLOTIN, D. VERGYRI, C. NETI, G. POTAMIANOS, J. LUETTIN
In this work we demonstrate an improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio- and visual- only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).

9:30, SPEECH-P7.7
APPLICATION OF AFFINE-INVARIANT FOURIER DESCRIPTORS TO LIPREADING FOR AUDIO-VISUAL SPEECH RECOGNITION
S. GURBUZ, Z. TUFEKCI, E. PATTERSON, J. GOWDY
This work focuses on a novel affine-invariant lipreading method, and its optimal combination with an audio subsystem to implement an audio-visual automatic speech recognition (AV-ASR) system. The lipreading method is based on outer lip contour description which is transformed to the Fourier domain and normalized there to eliminate dependencies on the affine transformation (translation, rotation, scaling, and shear) and on the starting point. The optimal combination algorithm incorporates a signal-to-noise ratio (SNR) based weight selection rule which leads to a more accurate global likelihood ratio test. Experimental results are presented for an isolated word recognition task for eight different noise types from the NOISEX data base for several SNR values.

9:30, SPEECH-P7.8
MEASURING THE RELATION BETWEEN SPEECH ACOUSTICS AND 2D FACIAL MOTION
H. YEHIA, A. BARBOSA
This paper presents a quantitative analysis of the relation between speech acoustics and the 2D video signal of the facial motion that occurs simultaneously. 2D facial motion is acquired using an ordinary video camera: after digitizing a video sequence, a search algorithm is used for tracking markers painted on the speaker's face. Facial motion is represented by the 2D marker trajectories; whereas LSP coefficients are used to parameterize the speech acoustics. LSP coefficients and the marker trajectories are then used to train time-invariant and time-varying linear models, as well as nonlinear (neural network) models. In turn, these models are used to evaluate to which extent 2D facial motion is determined from speech acoustics. The correlation coefficients between measured and estimated trajectories are as high as 0.95. This estimation of facial motion from speech acoustics indicates a way to integrate audio and visual signals for efficient audiovisual speech coding.