Chair: Michael Picheny, IBM (USA)
L. R. Bahl, IBM (USA)
S. Balakrishnan-Aiyer, IBM (USA)
J.R. Bellegarda, IBM (USA)
M. Franz, IBM (USA)
P.S. Gopalakrishnan, IBM (USA)
D. Nahamoo, IBM (USA)
M. Novak, IBM (USA)
M. Padmanabhan, IBM (USA)
M.A. Picheny, IBM (USA)
S. Roukos, IBM (USA)
In this paper we discuss various experimental results using our continuous speech recognition system on the Wall Street Jounal task. Experiments with different feature extraction methods, varying amounts and type of training data, and different vocabulary sizes are reported.
Douglas B. Paul, MIT Lincoln Laboratory (USA)
The system described here is a large-vocabulary continuous-speech recognition system developed using the ARPA Wall Street Journal and North American Business databases. The recognizer uses a stack decoder-based search strategy with a left-to-right stochastic language model. This decoder has been shown to function effectively on 56K-word recognition of continuous speech. It operates left-to-right and can produce final textual output while continuing to accept additional input. The recognizer also features recognition-time adaptation to the user's voice. The new system showed a 48% reduction in the word error rate over the previously reported November 1992 system.
Xavier Aubert, Philips GmbH Research Laboratories - Aachen
Hermann Ney, Aachen University of Technology (GERMANY)
We address the problem of using word graphs (or lattices) for the integration of complex knowledge sources like long span language models or acoustic cross-word models, in large vocabulary continuous speech recognition. A method for efficiently constructing a word graph is reviewed and two ways of exploiting it are presented. By assuming the word pair approximation, a phrase level search is possible while in the other case a general graph decoder is set up. We show that the predecessor-word identity provided by a first bigram decoding might be used to constrain the word graph without impairing the next pass. This procedure has been applied to 64k-word trigram decoding in conjunction with an incremental unsupervised speaker adaptation scheme. Experimental results are given for the North American Business corpus used in the November '94 evaluation.
P. Jeanrenaud, BBN Systems and Technologies (USA)
E. Eide, BBN Systems and Technologies (USA)
U. Chaudhari, BBN Systems and Technologies (USA)
J. McDonough, BBN Systems and Technologies (USA)
K. Ng, BBN Systems and Technologies (USA)
M. Siu, BBN Systems and Technologies (USA)
H. Gish, BBN Systems and Technologies (USA)
Speech recognition of conversational speech is a difficult task. The performance levels on the Switchboard corpus had been in the vicinity of 70% word error rate. In this paper, we describe the results of applying a variety of modifications to our speech recognition system and we show their impact on improving the performance on conversational speech. These modifications include the use of more complex models, trigram language models, and cross-word triphone models. We also show the effect of using additional acoustic training on the recognition performance. Finally, we present an approach to dealing with the abundance of short words, and examine how the variable speaking rate found in conversational speech impacts on the performance. Currently, the level of performance is at the vicinity of 50% error, a significant improvement over recent levels.
Ren-Yuan Lyu, National Taiwan University
Lee-Feng Chien, Academia Sinica
Shiao-Hong Hwang, National Taiwan University
Hung-Yun Hsieh, National Taiwan University
Rung-Chiuan Yang, National Taiwan University
Bo-Ren Bai, National Taiwan University
Jia-Chi Weng, National Taiwan University
Yen- Ju Yang, National Taiwan University
Shi-Wei Lin, National Taiwan University
Keh- Jiann Chen, Academia Sinica (REPUBLIC OF CHINA)
Chiu-Yu Tseng, Academia Sinica (REPUBLIC OF CHINA)
Lin-Shan Lee, Academia Sinica (REPUBLIC OF CHINA)
This paper represents a prototype prosodic- segment-based dictation machine for the Chinese language with very large vocabulary. It accepts utterances continuous within a prosodic segment which is composed of one or a few word(s). It also possesses various on-line learning capabilities for fast adaptation to a new user in acoustic, lexical and linguistic levels. The overall system is implemented on an IBM/PC with an additional DSP card including a Motorola DSP 96002 chip. The word accuracy can achieve nearly 90% for a new user after he produces about 10 minutes of speech to train the system, and the accuracy can be further improved with the on- line learning functions.
Hsin-min Wang, National Taiwan University
Jia-lin Shen, National Taiwan University
Yen-Ju Yang, National Taiwan University
Chiu-Yu Tseng, Academia Sinica
Lin-Shan Lee, National Taiwan University (REPUBLIC OF CHINA)
This paper presents the first known results for complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but very limited training data. Although some isolated- syllable-based or isolated-word-based large-vocabulary Mandarin speech recognition systems have been successfully developed, a continuous-speech-based system of this kind has never been reported before. For successful development of this system, several important techniques have been presented in this paper, including acoustic modeling of a set of sub- syllabic models for base syllable recognition and another set of context-dependent models for tone recognition, a multiple candidate searching technique based on a concatenated syllable matching algorithm to synchronize base syllable and tone recognition, and a word-class-based Chinese language model for linguistic decoding. The best recognition accuracy achieved is 88.69% for finally decoded Chinese characters, while 88.69%, 91.57%, and 81.37% for base syllables, tones, and tonal syllables, respectively.
J.L. Gauvain, LIMSI-CNRS (FRANCE)
L. Lamel, LIMSI-CNRS (FRANCE)
M. Adda-Decker, LIMSI-CNRS (FRANCE)
In this paper we report on our recent development work in large vocabulary, American English continuous speech dictation. We have experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary so as to reduce the number of errors due to out-of-vocabulary words, (3) extensions to the lexical representation, (4) the use of additional acoustic training data, and (5) modification of the acoustic models for telephone speech. The recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke Benchmark test. Experimental results for development and evaluation test data are given, as well as an analysis of the errors on the development data.
M.M. Hochberg, Cambridge University
S.J. Renals, University of Sheffield
A.J. Robinson, Cambridge University (ENGLAND)
G. D. Cook, Cambridge University (ENGLAND)
ABBOT is the hybrid connectionist-hidden Markov model (HMM) large-vocabulary continuous speech recognition (CSR) system developed at Cambridge University. This system uses a recurrent network to estimate the acoustic observation probabilities within an HMM framework. A major advantage of this approach is that good performance is achieved using context-independent acoustic models and requiring many fewer parameters than comparable HMM systems. This paper presents substantial performance improvements gained from new approaches to connectionist model combination and phone-duration modeling. Additional capability has also been achieved by extending the decoder to handle larger vocabulary tasks (20,000 words and greater) with a trigram language model. This paper describes the recent modifications to the system and experimental results are reported for various test and development sets from the November 1992, 1993, and 1994 ARPA evaluations of spoken language systems.
P. C. Woodland, Cambridge University (UK)
C. J. Leggetter, Cambridge University (UK)
J. J. Odell, Cambridge University (UK)
V. Valtchev, Cambridge University (UK)
S.J. Young, Cambridge University (UK)
This paper describes recent work on the HTK large vocabulary speech recognition system. The system uses tied-state cross-word context-dependent mixture Gaussian HMMs and a dynamic network decoder that can operate in a single pass. In the last year the decoder has been extended to produce word lattices to allow flexible and efficient system development, as well as multi-pass operation for use with computationally expensive acoustic and/or language models. The system vocabulary can now be up to 65k words, the final acoustic models have been extended to be sensitive to more acoustic context (quinphones), a 4-gram language model has been used and unsupervised incremental speaker adaptation incorporated. The resulting system gave the lowest error rates on both the H1-P0 and H1-C1 hub tasks in the November 1994 ARPA CSR evaluation.
Yuqing Gao, National University of Singapore (SINGAPORE)
Hsiao-Wuen Hon, National University of Singapore (SINGAPORE)
Zhiwei Lin, National University of Singapore (SINGAPORE)
Gareth Loudon, National University of Singapore (SINGAPORE)
S. Yogananthan, National University of Singapore (SINGAPORE)
Baosheng Yuan, National University of Singapore (SINGAPORE)
The text input for non-alphabetic languages, such as Chinese, has been a decades-long problem. Chinese Dictation using large vocabulary speech recognition provides a convenient mode of text entry. In contrast to a character based Dictation system ^[5], a word-based Mandarin dictation system has been designed ^[3] (based on Apple's PlainTalk speech recognition technology ^[4]) for efficient entry of Chinese characters into a computer. In this paper new features and improvements to the Dictation system are presented. The new features and improvements have produced an overall reduction in recognition error of 50-80%. The vocabulary has also been increased from 5,000 words to over 11,000 words.