3:30, SPEECH-L9.1
IMPROVEMENTS IN LINEAR TRANSFORM BASED SPEAKER ADAPTATION
L. UEBEL, P. WOODLAND
This paper presents three forms of linear transform based speaker
adaptation that can give better performance than standard maximum
likelihood linear regression (MLLR) adaptation. For unsupervised
adaptation, a lattice-based technique is introduced which is
compared to MLLR using confidence scores. For supervised adaptation,
estimation of the adaptation matrices using the maximum mutual
information criterion is discussed which leads to the MMILR approach.
Recognition experiments show that lattice MLLR can reduce word
error rates on a Switchboard task by 1.4% absolute. For recognition
of non-native speech from the Wall Street Journal database, a
reduction in word error rate of 10-16% relative was obtained using MMILR compared to standard MLLR.
3:50, SPEECH-L9.2
INNOVATIVE APPROACHES FOR LARGE VOCABULARY NAME RECOGNITION
Y. GAO, B. RAMABHADRAN, J. CHEN, H. ERDOGAN, M. PICHENY
Automatic name dialing is a practical and interesting application of
speech recognition on telephony systems. The IBM name recognition
system is a large vocabulary, speaker independent system currently in
use for reaching IBM employees in the United States. In this paper, we
present some innovative algorithms that improve name recognition
accuracy. Unlike transcription tasks, such as the Switchboard task,
recognition of names poses a variety of different problems. Several of
these problems arise from the fact that foreign names are hard to
pronounce for speakers who are not familiar with the names and that
there are no standardized methods for pronouncing proper names. Noise
robustness is another very important factor as these calls are
typically made in noisy environments, such as from a car, cafeteria,
airport, etc. and over different kinds of cellular and land-line telephone channels. We have performed a systematic analysis of the speech recognition errors and tackled the issues separately with techniques ranging from weighted speaker clustering, massive adaptation, rapid and unsupervised adaptation methods to pronounciation modeling methods. We find that the decoding accuracy can be improved significantly (28% relative) in this manner.
4:10, SPEECH-L9.3
NEW FEATURES IN THE CU-HTK SYSTEM FOR TRANSCRIPTION OF CONVERSATIONAL TELEPHONE SPEECH
T. HAIN, P. WOODLAND, G. EVERMANN, D. POVEY
This paper discusses new features integrated into the
Cambridge University HTK (CU-HTK) system for the transcription of
conversational telephone speech. Major improvements have been
achieved by the use of maximum mutual information estimation in
training as well as maximum likelihood estimation; the use of a full
variance transform for adaptation; the inclusion of unigram
pronunciation probabilities; and word-level posterior probability
estimation using confusion networks for use in minimum word error
rate decoding, confidence score estimation and system
combination. Improvements are demonstrated via performance on the
NIST March 2000 evaluation of English conversational telephone
speech transcription (Hub5E). In this evaluation the CU-HTK system
gave an overall word error rate of 25.4%, which was the best
performance by a statistically significant margin.
4:30, SPEECH-L9.4
RECOGNIZE TONE LANGUAGES USING PITCH INFORMATION ON THE MAIN VOWEL OF EACH SYLLABLE
J. CHEN, H. LI, L. SHEN, G. FU
An innovative method for speech recognition of tone languages is reported. By definition, the tone of a syllable is determined by the pitch contour of the entire syllable. We propose that the pitch information on the main vowel of a syllable is sufficient to determine the tone of that syllable. Therefore, to recognize tone languages, only main vowels are needed to associate with tones. The number of basic phonetic units required to recognize tone languages is greatly reduced. We then report experimental results on Cantonese and Mandarin. In both cases, using the main vowel method, while the number of phonemes and the quantity of training data are substantially reduced, the decoding accuracy is improved over other methods. Possible applications of the new method to other tone languages, including Thai, Vietnamese, Japanese, Swedish, and Norwegian are discussed.
4:50, SPEECH-L9.5
THE ISL EVALUATION SYSTEM FOR VERBMOBIL-II
H. SOLTAU, T. SCHAAF, F. METZE, A. WAIBEL
This paper describes the 2000 ISL large vocabulary speech recognition
system for fast decoding of conversational speech which was used
in the German Verbmobil-II project. The challenge of this task is to
build robust acoustic models to handle different dialects, spontaneous
effects, and crosstalk as occur in conversational speech.
We present speaker incremental normalization and adaptation
experiments close to real-time constraints. To reduce the number of
consequential errors caused by out-of-vocabulary words (OOV), we conducted
filler-model experiments to handle unknown proper names. The overall
improvements from 1998 to 2000 resulted in a word error reduction
from 40% to 17% on our development test set.
5:10, SPEECH-L9.6
GAUSSIAN MIXTURE SELECTION USING CONTEXT-INDEPENDENT HMM
A. LEE, T. KAWAHARA, K. SHIKANO
We address a method to efficiently select Gaussian mixtures for fast
acoustic likelihood computation. It makes use of context-independent
models for selection and back-off of corresponding triphone models.
Specifically, for the k-best phone models by the preliminary
evaluation, triphone models of higher resolution are applied, and
others are assigned likelihoods with the monophone models. This
selection scheme assigns more reliable back-off likelihoods to the
un-selected states than the conventional Gaussian selection based on a
VQ codebook. It can also incorporate efficient Gaussian pruning at
the preliminary evaluation, which offsets the increased size of the
pre-selection model. Experimental results show that the proposed
method achieves comparable performance as the standard Gaussian
selection, and performs much better under aggressive pruning
condition. Together with the phonetic tied-mixture (PTM) modeling,
acoustic matching cost is reduced to almost 14\% with little loss of
accuracy.