ASR SYSTEM & CORPORA

Chair: John Bridle, Diagem (USA)

Home

WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition

Authors:

Tony Robinson, Cambridge University (UK)
Jeroen Fransen, Cambridge University (UK)
David Pye, Cambridge University (UK)
Jonathan Foote, Cambridge University (UK)
Steve Renals, Cambridge University (UK)

Volume 1, Page 81

Abstract:

A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAM0 constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5,000 word bigram and 20,000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.

300dpi TIFF Images of pages:

81 82 83 84

Acrobat PDF file of whole paper:

ic950081.pdf

TOP

Voice Across Hispanic America: A Telephone Speech Corpus of American Spanish

Authors:

Yeshwant Muthusamy, Texas Instruments
Edward Holliman, Texas Instruments
Barbara Wheatley, Texas Instruments
Joseph Picone, Mississippi State University
John Godfrey, University of Pennsylvania (USA)

Volume 1, Page 85

Abstract:

As part of the Polyphone project, Texas Instruments is in the process of collecting and developing a corpus of telephone speech in American Spanish. The corpus, called Voice Across Hispanic America (VAHA), will attempt to provide balanced phonetic coverage of the language, in addition to containing widely used vocabulary items such as digits, letter strings, yes/no responses, proper names, and selected command words and phrases used in automated telephone service applications. The speakers are native speakers of Spanish living in the United States. The collection and development of the corpus is expected to be completed by June 1995. So far, we have collected about 500 speakers from various parts of the U.S. In this paper, we describe the design issues in various aspects of the project, such as subject recruitment, corpus and prompt sheet design, the data acquisition system, and validation and transcription. We conclude with a brief statistical profile of the data collected.

300dpi TIFF Images of pages:

85 86 87 88

Acrobat PDF file of whole paper:

ic950085.pdf

TOP

Implementation of the POW (Phonetically Optimized Words) Algorithm for Speech Database

Authors:

Yeonja Lim, ETRI (KOREA)
Youngjik Lee, ETRI (KOREA)

Volume 1, Page 89

Abstract:

This paper proposes the concept of the POW (phonetically optimized words) set. To collect a speech database, all possible phonological phenomenon should be included. In addition, it is preferable to have the same phonological distribution as the general speech. For this purpose, we suggest a new algorithm for selecting a word set which has the properties that (1) it includes all phonological events, (2) it has the minimal number of words, and (3) the phonological similarity between the POW set and the high- frequency word set is maximized. We extract the Korean POW set from 50,000 high-frequency words out of a three million text corpus. The POW set is much more similar to the high-frequency word set than the PBW (phonetically balanced words) set with less number of words.

300dpi TIFF Images of pages:

89 90 91 92

Acrobat PDF file of whole paper:

ic950089.pdf

TOP

Microsoft Windows Highly Intelligent Speech Recognizer: Whisper

Authors:

Xuedong Huang, Microsoft Corporation (USA)
Alex Acero, Microsoft Corporation (USA)
Fil Alleva, Microsoft Corporation (USA)
Mei-Yuh Hwang, Microsoft Corporation (USA)
Li Jiang, Microsoft Corporation (USA)
Milind Mahajan, Microsoft Corporation (USA)

Volume 1, Page 93

Abstract:

Since January 1993, we have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for Microsoft Windows and can be scaled to meet different PC platform configurations. It provides features such as continuous speech recognition, speaker-independence, on-line adaptation, noise robustness, dynamic vocabularies, and grammars. For typical Windows Command-and-Control applications (less than 1,000 words), Whisper provides a software only solution on PCs equipped with a 486DX, 4MB of memory, and a standard sound card and a desk-top microphone.

300dpi TIFF Images of pages:

93 94 95 96

Acrobat PDF file of whole paper:

ic950093.pdf

TOP

Concept-Based Speech Translation

Authors:

L. Mayfield, Carnegie Mellon University (USA)
M. Gavalda, Carnegie Mellon University (USA)
W. Ward, Carnegie Mellon University (USA)
A. Waibel, Carnegie Mellon University (USA)

Volume 1, Page 97

Abstract:

As part of the JANUS speech-to-speech translation project, we have developed a robust translation system based on the information structures inherent to the task being performed. The basic premise is that the structure of the information to be transmitted is largely independent of the language used to encode it. Our system performs no syntactic analysis; speaker utterances are parsed into semantic chunks, which can be strung together without grammatical rules, and passed through a simple template-based translation module. We have achieved encouraging coverage rates on English, German, and Spanish input with English, German, and Spanish output.

300dpi TIFF Images of pages:

97 98 99 100

Acrobat PDF file of whole paper:

ic950097.pdf

TOP

PhoneBook: A Phonetically-Rich Isolated-Word Telephone-Speech Database

Authors:

John F. Pitrelli, NYNEX Science & Technology Inc. (USA)
Cynthia Fong, NYNEX Science & Technology Inc. (USA)
Suk H. Wong, NYNEX Science & Technology Inc. (USA)
Judith R. Spitz, NYNEX Science & Technology Inc. (USA)
Hong C. Leung, NYNEX Science & Technology Inc. (USA)

Volume 1, Page 101

Abstract:

We describe a phonetically-rich isolated-word telephone-speech database, PhoneBook, which was collected because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word recognition to speech-based applications over the telephone, and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition. PhoneBook has 8000 distinct words, selected for complete coverage of phoneme contexts enumerated using both triphones and a novel method which takes into account syllable position, lexical stress, and non-adjacent-phoneme coarticulatory effects. PhoneBook consists of more than 92,000 utterances, from over 1300 native speakers of American English reading lists of 75 words. This paper describes the word list design, talker enrollment procedure, recording procedure and equipment, utterance verification method, and summary statistics for PhoneBook, which will be made available through the Linguistic Data Consortium.

300dpi TIFF Images of pages:

101 102 103 104

Acrobat PDF file of whole paper:

ic950101.pdf

TOP

CTIMIT: A Speech Corpus for the Cellular Environment with Applications to Automatic Speech Recognition

Authors:

Kathy L. Brown, Lockheed Sanders Inc. (USA)
E. Bryan George, Lockheed Sanders Inc. (USA)

Volume 1, Page 105

Abstract:

This paper will report on techniques used in the generation of a continuous speech, multi-speaker, cellular bandwidth database and describe its application to automatic speech recognition in the cellular environment. CTIMIT (cellular TIMIT) has been generated by transmitting the TIMIT speech database over the cellular network. We will describe the preliminary collection of the CTIMIT database and report on several studies designed to test the utility of the database in a phoneme recognition task. Two HMM-based phoneme recognizers were trained using utterances drawn from the TIMIT database and the CTIMIT database, respectively. Each recognizer was then tested using the test utterances from CTIMIT. Phoneme recognition accuracy for the TIMIT-trained recognizer dropped 58% from its baseline performance on TIMIT test utterances. By comparison, phoneme recognition accuracy of the CTIMIT-trained recognizer increased 82% compared to that of the TIMIT-trained recognizer.

300dpi TIFF Images of pages:

105 106 107 108

Acrobat PDF file of whole paper:

ic950105.pdf

TOP

Toward Movement-Invariant Automatic Lip- Reading and Speech Recognition

Authors:

Paul Duchnowski, University of Karlsruhe (GERMANY)
Martin Hunke, Carnegie Mellon University (USA)
Dietrich Busching, University of Karlsruhe (GERMANY)
Uwe Meier, University of Karlsruhe (GERMANY)
Alex Waibel, Carnegie Mellon University (USA)

Volume 1, Page 109

Abstract:

We present the development of a modular system for flexible human--computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.

300dpi TIFF Images of pages:

109 110 111 112

Acrobat PDF file of whole paper:

ic950109.pdf

TOP

Some Results with a Trainable Speech Translation and Understanding System

Authors:

V.M. Jimenez, Universidad Politecnica de Valencia (SPAIN)
A. Castellanos, Universidad Politecnica de Valencia (SPAIN)
E. Vidal, Universidad Politecnica de Valencia (SPAIN)

Volume 1, Page 113

Abstract:

The problems of Limited-domain Spoken Language Translation and Understanding are considered. A standard Continuous Speech Recognizer is extended for using automatically learned finite-state transducers as translation models. Understanding is considered as a particular case of translation where the target language is a formal language. From the different approaches compared, the best results are obtained with a fully integrated approach, in which the input language acoustic and lexical models, and (N-gram) Language Models of input and output languages, are embedded into the learned transducers. Optimal search through this global network obtains the best translation for a given input acoustic signal.

300dpi TIFF Images of pages:

113 114 115 116

Acrobat PDF file of whole paper:

ic950113.pdf

TOP

A Continuous Speech Recognition System Using Finite State Network and Viterbi Beam Search for the Automatic Interpretation

Authors:

Nam-Yong Han, ETRI (KOREA)
Hoi-Rin Kim, ETRI (KOREA)
Kyu-Woong Hwang, ETRI (KOREA)
Young-Mok Ahn, ETRI (KOREA)
Joon-Hyung Ryoo, ETRI (KOREA)

Volume 1, Page 117

Abstract:

This paper describes a Korean continuous speech recognition system using phone based Semi-Continuous Hidden Markov Model (SCHMM) method for the Automatic Interpretation. The task domain is hotel reservation. The system has the following three features. First, an embedded bootstrapping training method that enables us to train each phone model without phoneme segmentation database is used. Second, a hybrid estimation method which is composed of the forward-backward algorithm and the Viterbi algorithm is proposed for the HMM parameter estimation. Third, a between-word modeling technique is used at function word boundaries. The recognition results in speaker independent experiments are as follows. In the case of Version 1, continuous speech recognition result is 89.1% and in Version 2, the result is 97.6%. --

300dpi TIFF Images of pages:

117 118 119 120

Acrobat PDF file of whole paper:

ic950117.pdf