Session: SPEECH-L9
Time: 3:30 - 5:30, Thursday, May 10, 2001
Location: Room 151
Title: Large Vocabulary Recognition 2
Chair: Michael Picheny

3:30, SPEECH-L9.1
IMPROVEMENTS IN LINEAR TRANSFORM BASED SPEAKER ADAPTATION
L. UEBEL, P. WOODLAND
This paper presents three forms of linear transform based speaker adaptation that can give better performance than standard maximum likelihood linear regression (MLLR) adaptation. For unsupervised adaptation, a lattice-based technique is introduced which is compared to MLLR using confidence scores. For supervised adaptation, estimation of the adaptation matrices using the maximum mutual information criterion is discussed which leads to the MMILR approach. Recognition experiments show that lattice MLLR can reduce word error rates on a Switchboard task by 1.4% absolute. For recognition of non-native speech from the Wall Street Journal database, a reduction in word error rate of 10-16% relative was obtained using MMILR compared to standard MLLR.

3:50, SPEECH-L9.2
INNOVATIVE APPROACHES FOR LARGE VOCABULARY NAME RECOGNITION
Y. GAO, B. RAMABHADRAN, J. CHEN, H. ERDOGAN, M. PICHENY
Automatic name dialing is a practical and interesting application of speech recognition on telephony systems. The IBM name recognition system is a large vocabulary, speaker independent system currently in use for reaching IBM employees in the United States. In this paper, we present some innovative algorithms that improve name recognition accuracy. Unlike transcription tasks, such as the Switchboard task, recognition of names poses a variety of different problems. Several of these problems arise from the fact that foreign names are hard to pronounce for speakers who are not familiar with the names and that there are no standardized methods for pronouncing proper names. Noise robustness is another very important factor as these calls are typically made in noisy environments, such as from a car, cafeteria, airport, etc. and over different kinds of cellular and land-line telephone channels. We have performed a systematic analysis of the speech recognition errors and tackled the issues separately with techniques ranging from weighted speaker clustering, massive adaptation, rapid and unsupervised adaptation methods to pronounciation modeling methods. We find that the decoding accuracy can be improved significantly (28% relative) in this manner.

4:10, SPEECH-L9.3
NEW FEATURES IN THE CU-HTK SYSTEM FOR TRANSCRIPTION OF CONVERSATIONAL TELEPHONE SPEECH
T. HAIN, P. WOODLAND, G. EVERMANN, D. POVEY
This paper discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved by the use of maximum mutual information estimation in training as well as maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. Improvements are demonstrated via performance on the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). In this evaluation the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin.

4:30, SPEECH-L9.4
RECOGNIZE TONE LANGUAGES USING PITCH INFORMATION ON THE MAIN VOWEL OF EACH SYLLABLE
J. CHEN, H. LI, L. SHEN, G. FU
An innovative method for speech recognition of tone languages is reported. By definition, the tone of a syllable is determined by the pitch contour of the entire syllable. We propose that the pitch information on the main vowel of a syllable is sufficient to determine the tone of that syllable. Therefore, to recognize tone languages, only main vowels are needed to associate with tones. The number of basic phonetic units required to recognize tone languages is greatly reduced. We then report experimental results on Cantonese and Mandarin. In both cases, using the main vowel method, while the number of phonemes and the quantity of training data are substantially reduced, the decoding accuracy is improved over other methods. Possible applications of the new method to other tone languages, including Thai, Vietnamese, Japanese, Swedish, and Norwegian are discussed.

4:50, SPEECH-L9.5
THE ISL EVALUATION SYSTEM FOR VERBMOBIL-II
H. SOLTAU, T. SCHAAF, F. METZE, A. WAIBEL
This paper describes the 2000 ISL large vocabulary speech recognition system for fast decoding of conversational speech which was used in the German Verbmobil-II project. The challenge of this task is to build robust acoustic models to handle different dialects, spontaneous effects, and crosstalk as occur in conversational speech. We present speaker incremental normalization and adaptation experiments close to real-time constraints. To reduce the number of consequential errors caused by out-of-vocabulary words (OOV), we conducted filler-model experiments to handle unknown proper names. The overall improvements from 1998 to 2000 resulted in a word error reduction from 40% to 17% on our development test set.

5:10, SPEECH-L9.6
GAUSSIAN MIXTURE SELECTION USING CONTEXT-INDEPENDENT HMM
A. LEE, T. KAWAHARA, K. SHIKANO
We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the k-best phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture (PTM) modeling, acoustic matching cost is reduced to almost 14\% with little loss of accuracy.