RECOGNITION: LARGE VOCABULARY

Chair: Michael Picheny, IBM (USA)

Home

Performance of the IBM Large Vocabulary Continuous Speech Recognition System on the ARPA Wall Street Journal Task

Authors:

L. R. Bahl, IBM (USA)
S. Balakrishnan-Aiyer, IBM (USA)
J.R. Bellegarda, IBM (USA)
M. Franz, IBM (USA)
P.S. Gopalakrishnan, IBM (USA)
D. Nahamoo, IBM (USA)
M. Novak, IBM (USA)
M. Padmanabhan, IBM (USA)
M.A. Picheny, IBM (USA)
S. Roukos, IBM (USA)

Volume 1, Page 41

Abstract:

In this paper we discuss various experimental results using our continuous speech recognition system on the Wall Street Jounal task. Experiments with different feature extraction methods, varying amounts and type of training data, and different vocabulary sizes are reported.

300dpi TIFF Images of pages:

41 42 43 44

Acrobat PDF file of whole paper:

ic950041.pdf

TOP

New Developments in the Lincoln Stack-Decoder Based Large Vocabulary CSR System

Authors:

Douglas B. Paul, MIT Lincoln Laboratory (USA)

Volume 1, Page 45

Abstract:

The system described here is a large-vocabulary continuous-speech recognition system developed using the ARPA Wall Street Journal and North American Business databases. The recognizer uses a stack decoder-based search strategy with a left-to-right stochastic language model. This decoder has been shown to function effectively on 56K-word recognition of continuous speech. It operates left-to-right and can produce final textual output while continuing to accept additional input. The recognizer also features recognition-time adaptation to the user's voice. The new system showed a 48% reduction in the word error rate over the previously reported November 1992 system.

300dpi TIFF Images of pages:

45 46 47 48

Acrobat PDF file of whole paper:

ic950045.pdf

TOP

Large Vocabulary Continuous Speech Recognition Using Word Graphs

Authors:

Xavier Aubert, Philips GmbH Research Laboratories - Aachen
Hermann Ney, Aachen University of Technology (GERMANY)

Volume 1, Page 49

Abstract:

We address the problem of using word graphs (or lattices) for the integration of complex knowledge sources like long span language models or acoustic cross-word models, in large vocabulary continuous speech recognition. A method for efficiently constructing a word graph is reviewed and two ways of exploiting it are presented. By assuming the word pair approximation, a phrase level search is possible while in the other case a general graph decoder is set up. We show that the predecessor-word identity provided by a first bigram decoding might be used to constrain the word graph without impairing the next pass. This procedure has been applied to 64k-word trigram decoding in conjunction with an incremental unsupervised speaker adaptation scheme. Experimental results are given for the North American Business corpus used in the November '94 evaluation.

300dpi TIFF Images of pages:

49 50 51 52

Acrobat PDF file of whole paper:

ic950049.pdf

TOP

Reducing Word Error Rate on Conversational Speech from the Switchboard Corpus

Authors:

P. Jeanrenaud, BBN Systems and Technologies (USA)
E. Eide, BBN Systems and Technologies (USA)
U. Chaudhari, BBN Systems and Technologies (USA)
J. McDonough, BBN Systems and Technologies (USA)
K. Ng, BBN Systems and Technologies (USA)
M. Siu, BBN Systems and Technologies (USA)
H. Gish, BBN Systems and Technologies (USA)

Volume 1, Page 53

Abstract:

Speech recognition of conversational speech is a difficult task. The performance levels on the Switchboard corpus had been in the vicinity of 70% word error rate. In this paper, we describe the results of applying a variety of modifications to our speech recognition system and we show their impact on improving the performance on conversational speech. These modifications include the use of more complex models, trigram language models, and cross-word triphone models. We also show the effect of using additional acoustic training on the recognition performance. Finally, we present an approach to dealing with the abundance of short words, and examine how the variable speaking rate found in conversational speech impacts on the performance. Currently, the level of performance is at the vicinity of 50% error, a significant improvement over recent levels.

300dpi TIFF Images of pages:

53 54 55 56

Acrobat PDF file of whole paper:

ic950053.pdf

TOP

Golden Mandarin (III)--A User-Adaptive Prosodic - Segment-Based Mandarin Dictation Machine for Chinese Language with Very Large Vocabulary

Authors:

Ren-Yuan Lyu, National Taiwan University
Lee-Feng Chien, Academia Sinica
Shiao-Hong Hwang, National Taiwan University
Hung-Yun Hsieh, National Taiwan University
Rung-Chiuan Yang, National Taiwan University
Bo-Ren Bai, National Taiwan University
Jia-Chi Weng, National Taiwan University
Yen- Ju Yang, National Taiwan University
Shi-Wei Lin, National Taiwan University
Keh- Jiann Chen, Academia Sinica (REPUBLIC OF CHINA)
Chiu-Yu Tseng, Academia Sinica (REPUBLIC OF CHINA)
Lin-Shan Lee, Academia Sinica (REPUBLIC OF CHINA)

Volume 1, Page 57

Abstract:

This paper represents a prototype prosodic- segment-based dictation machine for the Chinese language with very large vocabulary. It accepts utterances continuous within a prosodic segment which is composed of one or a few word(s). It also possesses various on-line learning capabilities for fast adaptation to a new user in acoustic, lexical and linguistic levels. The overall system is implemented on an IBM/PC with an additional DSP card including a Motorola DSP 96002 chip. The word accuracy can achieve nearly 90% for a new user after he produces about 10 minutes of speech to train the system, and the accuracy can be further improved with the on- line learning functions.

300dpi TIFF Images of pages:

57 58 59 60

Acrobat PDF file of whole paper:

ic950057.pdf

TOP

Complete Recognition of Continuous Mandarin Speech for Chinese Language with Very Large Vocabulary but Limited Training Data

Authors:

Hsin-min Wang, National Taiwan University
Jia-lin Shen, National Taiwan University
Yen-Ju Yang, National Taiwan University
Chiu-Yu Tseng, Academia Sinica
Lin-Shan Lee, National Taiwan University (REPUBLIC OF CHINA)

Volume 1, Page 61

Abstract:

This paper presents the first known results for complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but very limited training data. Although some isolated- syllable-based or isolated-word-based large-vocabulary Mandarin speech recognition systems have been successfully developed, a continuous-speech-based system of this kind has never been reported before. For successful development of this system, several important techniques have been presented in this paper, including acoustic modeling of a set of sub- syllabic models for base syllable recognition and another set of context-dependent models for tone recognition, a multiple candidate searching technique based on a concatenated syllable matching algorithm to synchronize base syllable and tone recognition, and a word-class-based Chinese language model for linguistic decoding. The best recognition accuracy achieved is 88.69% for finally decoded Chinese characters, while 88.69%, 91.57%, and 81.37% for base syllables, tones, and tonal syllables, respectively.

300dpi TIFF Images of pages:

61 62 63 64

Acrobat PDF file of whole paper:

ic950061.pdf

TOP

Developments in Continuous Speech Dictation Using the ARPA WSJ Task

Authors:

J.L. Gauvain, LIMSI-CNRS (FRANCE)
L. Lamel, LIMSI-CNRS (FRANCE)
M. Adda-Decker, LIMSI-CNRS (FRANCE)

Volume 1, Page 65

Abstract:

In this paper we report on our recent development work in large vocabulary, American English continuous speech dictation. We have experimented with (1) alternative analyses for the acoustic front end, (2) the use of an enlarged vocabulary so as to reduce the number of errors due to out-of-vocabulary words, (3) extensions to the lexical representation, (4) the use of additional acoustic training data, and (5) modification of the acoustic models for telephone speech. The recognizer was evaluated on Hubs 1 and 2 of the fall 1994 ARPA NAB CSR Hub and Spoke Benchmark test. Experimental results for development and evaluation test data are given, as well as an analysis of the errors on the development data.

300dpi TIFF Images of pages:

65 66 67 68

Acrobat PDF file of whole paper:

ic950065.pdf

TOP

Recent Improvements to the ABBOT Large Vocabulary CSR System

Authors:

M.M. Hochberg, Cambridge University
S.J. Renals, University of Sheffield
A.J. Robinson, Cambridge University (ENGLAND)
G. D. Cook, Cambridge University (ENGLAND)

Volume 1, Page 69

Abstract:

ABBOT is the hybrid connectionist-hidden Markov model (HMM) large-vocabulary continuous speech recognition (CSR) system developed at Cambridge University. This system uses a recurrent network to estimate the acoustic observation probabilities within an HMM framework. A major advantage of this approach is that good performance is achieved using context-independent acoustic models and requiring many fewer parameters than comparable HMM systems. This paper presents substantial performance improvements gained from new approaches to connectionist model combination and phone-duration modeling. Additional capability has also been achieved by extending the decoder to handle larger vocabulary tasks (20,000 words and greater) with a trigram language model. This paper describes the recent modifications to the system and experimental results are reported for various test and development sets from the November 1992, 1993, and 1994 ARPA evaluations of spoken language systems.

300dpi TIFF Images of pages:

69 70 71 72

Acrobat PDF file of whole paper:

ic950069.pdf

TOP

The 1994 HTK Large Vocabulary Speech Recognition System

Authors:

P. C. Woodland, Cambridge University (UK)
C. J. Leggetter, Cambridge University (UK)
J. J. Odell, Cambridge University (UK)
V. Valtchev, Cambridge University (UK)
S.J. Young, Cambridge University (UK)

Volume 1, Page 73

Abstract:

This paper describes recent work on the HTK large vocabulary speech recognition system. The system uses tied-state cross-word context-dependent mixture Gaussian HMMs and a dynamic network decoder that can operate in a single pass. In the last year the decoder has been extended to produce word lattices to allow flexible and efficient system development, as well as multi-pass operation for use with computationally expensive acoustic and/or language models. The system vocabulary can now be up to 65k words, the final acoustic models have been extended to be sensitive to more acoustic context (quinphones), a 4-gram language model has been used and unsupervised incremental speaker adaptation incorporated. The resulting system gave the lowest error rates on both the H1-P0 and H1-C1 hub tasks in the November 1994 ARPA CSR evaluation.

300dpi TIFF Images of pages:

73 74 75 76

Acrobat PDF file of whole paper:

ic950073.pdf

TOP

Tangerine: A Large Vocabulary Mandarin Dictation System

Authors:

Yuqing Gao, National University of Singapore (SINGAPORE)
Hsiao-Wuen Hon, National University of Singapore (SINGAPORE)
Zhiwei Lin, National University of Singapore (SINGAPORE)
Gareth Loudon, National University of Singapore (SINGAPORE)
S. Yogananthan, National University of Singapore (SINGAPORE)
Baosheng Yuan, National University of Singapore (SINGAPORE)

Volume 1, Page 77

Abstract:

The text input for non-alphabetic languages, such as Chinese, has been a decades-long problem. Chinese Dictation using large vocabulary speech recognition provides a convenient mode of text entry. In contrast to a character based Dictation system ^[5], a word-based Mandarin dictation system has been designed ^[3] (based on Apple's PlainTalk speech recognition technology ^[4]) for efficient entry of Chinese characters into a computer. In this paper new features and improvements to the Dictation system are presented. The new features and improvements have produced an overall reduction in recognition error of 50-80%. The vocabulary has also been increased from 5,000 words to over 11,000 words.

300dpi TIFF Images of pages:

77 78 79 80

Acrobat PDF file of whole paper:

ic950077.pdf