Chair: John Bridle, Diagem (USA)
Tony Robinson, Cambridge University (UK)
Jeroen Fransen, Cambridge University (UK)
David Pye, Cambridge University (UK)
Jonathan Foote, Cambridge University (UK)
Steve Renals, Cambridge University (UK)
A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAM0 constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5,000 word bigram and 20,000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.
Yeshwant Muthusamy, Texas Instruments
Edward Holliman, Texas Instruments
Barbara Wheatley, Texas Instruments
Joseph Picone, Mississippi State University
John Godfrey, University of Pennsylvania (USA)
As part of the Polyphone project, Texas Instruments is in the process of collecting and developing a corpus of telephone speech in American Spanish. The corpus, called Voice Across Hispanic America (VAHA), will attempt to provide balanced phonetic coverage of the language, in addition to containing widely used vocabulary items such as digits, letter strings, yes/no responses, proper names, and selected command words and phrases used in automated telephone service applications. The speakers are native speakers of Spanish living in the United States. The collection and development of the corpus is expected to be completed by June 1995. So far, we have collected about 500 speakers from various parts of the U.S. In this paper, we describe the design issues in various aspects of the project, such as subject recruitment, corpus and prompt sheet design, the data acquisition system, and validation and transcription. We conclude with a brief statistical profile of the data collected.
Yeonja Lim, ETRI (KOREA)
Youngjik Lee, ETRI (KOREA)
This paper proposes the concept of the POW (phonetically optimized words) set. To collect a speech database, all possible phonological phenomenon should be included. In addition, it is preferable to have the same phonological distribution as the general speech. For this purpose, we suggest a new algorithm for selecting a word set which has the properties that (1) it includes all phonological events, (2) it has the minimal number of words, and (3) the phonological similarity between the POW set and the high- frequency word set is maximized. We extract the Korean POW set from 50,000 high-frequency words out of a three million text corpus. The POW set is much more similar to the high-frequency word set than the PBW (phonetically balanced words) set with less number of words.
Xuedong Huang, Microsoft Corporation (USA)
Alex Acero, Microsoft Corporation (USA)
Fil Alleva, Microsoft Corporation (USA)
Mei-Yuh Hwang, Microsoft Corporation (USA)
Li Jiang, Microsoft Corporation (USA)
Milind Mahajan, Microsoft Corporation (USA)
Since January 1993, we have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for Microsoft Windows and can be scaled to meet different PC platform configurations. It provides features such as continuous speech recognition, speaker-independence, on-line adaptation, noise robustness, dynamic vocabularies, and grammars. For typical Windows Command-and-Control applications (less than 1,000 words), Whisper provides a software only solution on PCs equipped with a 486DX, 4MB of memory, and a standard sound card and a desk-top microphone.
L. Mayfield, Carnegie Mellon University (USA)
M. Gavalda, Carnegie Mellon University (USA)
W. Ward, Carnegie Mellon University (USA)
A. Waibel, Carnegie Mellon University (USA)
As part of the JANUS speech-to-speech translation project, we have developed a robust translation system based on the information structures inherent to the task being performed. The basic premise is that the structure of the information to be transmitted is largely independent of the language used to encode it. Our system performs no syntactic analysis; speaker utterances are parsed into semantic chunks, which can be strung together without grammatical rules, and passed through a simple template-based translation module. We have achieved encouraging coverage rates on English, German, and Spanish input with English, German, and Spanish output.
John F. Pitrelli, NYNEX Science & Technology Inc. (USA)
Cynthia Fong, NYNEX Science & Technology Inc. (USA)
Suk H. Wong, NYNEX Science & Technology Inc. (USA)
Judith R. Spitz, NYNEX Science & Technology Inc. (USA)
Hong C. Leung, NYNEX Science & Technology Inc. (USA)
We describe a phonetically-rich isolated-word telephone-speech database, PhoneBook, which was collected because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word recognition to speech-based applications over the telephone, and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition. PhoneBook has 8000 distinct words, selected for complete coverage of phoneme contexts enumerated using both triphones and a novel method which takes into account syllable position, lexical stress, and non-adjacent-phoneme coarticulatory effects. PhoneBook consists of more than 92,000 utterances, from over 1300 native speakers of American English reading lists of 75 words. This paper describes the word list design, talker enrollment procedure, recording procedure and equipment, utterance verification method, and summary statistics for PhoneBook, which will be made available through the Linguistic Data Consortium.
Kathy L. Brown, Lockheed Sanders Inc. (USA)
E. Bryan George, Lockheed Sanders Inc. (USA)
This paper will report on techniques used in the generation of a continuous speech, multi-speaker, cellular bandwidth database and describe its application to automatic speech recognition in the cellular environment. CTIMIT (cellular TIMIT) has been generated by transmitting the TIMIT speech database over the cellular network. We will describe the preliminary collection of the CTIMIT database and report on several studies designed to test the utility of the database in a phoneme recognition task. Two HMM-based phoneme recognizers were trained using utterances drawn from the TIMIT database and the CTIMIT database, respectively. Each recognizer was then tested using the test utterances from CTIMIT. Phoneme recognition accuracy for the TIMIT-trained recognizer dropped 58% from its baseline performance on TIMIT test utterances. By comparison, phoneme recognition accuracy of the CTIMIT-trained recognizer increased 82% compared to that of the TIMIT-trained recognizer.
Paul Duchnowski, University of Karlsruhe (GERMANY)
Martin Hunke, Carnegie Mellon University (USA)
Dietrich Busching, University of Karlsruhe (GERMANY)
Uwe Meier, University of Karlsruhe (GERMANY)
Alex Waibel, Carnegie Mellon University (USA)
We present the development of a modular system for flexible human--computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.
V.M. Jimenez, Universidad Politecnica de Valencia (SPAIN)
A. Castellanos, Universidad Politecnica de Valencia (SPAIN)
E. Vidal, Universidad Politecnica de Valencia (SPAIN)
The problems of Limited-domain Spoken Language Translation and Understanding are considered. A standard Continuous Speech Recognizer is extended for using automatically learned finite-state transducers as translation models. Understanding is considered as a particular case of translation where the target language is a formal language. From the different approaches compared, the best results are obtained with a fully integrated approach, in which the input language acoustic and lexical models, and (N-gram) Language Models of input and output languages, are embedded into the learned transducers. Optimal search through this global network obtains the best translation for a given input acoustic signal.
Nam-Yong Han, ETRI (KOREA)
Hoi-Rin Kim, ETRI (KOREA)
Kyu-Woong Hwang, ETRI (KOREA)
Young-Mok Ahn, ETRI (KOREA)
Joon-Hyung Ryoo, ETRI (KOREA)
This paper describes a Korean continuous speech recognition system using phone based Semi-Continuous Hidden Markov Model (SCHMM) method for the Automatic Interpretation. The task domain is hotel reservation. The system has the following three features. First, an embedded bootstrapping training method that enables us to train each phone model without phoneme segmentation database is used. Second, a hybrid estimation method which is composed of the forward-backward algorithm and the Viterbi algorithm is proposed for the HMM parameter estimation. Third, a between-word modeling technique is used at function word boundaries. The recognition results in speaker independent experiments are as follows. In the case of Version 1, continuous speech recognition result is 89.1% and in Version 2, the result is 97.6%. --