Session: SPEECH-L12
Time: 3:30 - 5:30, Friday, May 11, 2001
Location: Room 151
Title: Acoustic and Lexical Modeling
Chair: Rick Rose

3:30, SPEECH-L12.1
AUTOMATIC GENERATION AND SELECTION OF MULTIPLE PRONUNCIATIONS FOR DYNAMIC VOCABULARIES
S. DELIGNE, B. MAISON, R. GOPINATH
In this paper, we present a new scheme for the acoustic modeling of speech recognition applications requiring dynamic vocabularies. It applies especially to the acoustic modeling of out-of-vocabulary words which need to be added to a recognition lexicon based on the observation of a few (say one or two) speech utterances of these words. Standard approaches to this problem derive a single pronunciation from each speech utterance by combining acoustic and phone transition scores. In our scheme, multiple pronunciations are generated from each speech utterance of a word to enroll by varying the relative weights assigned to the acoustic and phone transition models. In our experiments, the use of these multiple baseforms dramatically outperforms the standard approach with a relative decrease of the word error rate ranging from 20% to 40% on all our test sets.

3:50, SPEECH-L12.2
AUTOMATIC GENERATION OF PRONUNCIATION LEXICONS FOR MANDARIN SPONTANEOUS SPEECH
V. VENKATARAMANI, W. BYRNE, T. KAMM, T. ZHENG, Z. SONG, P. FUNG, L. YI, U. RUHI
Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed for English are applied to this corpus to train pronunciation models which are then used for Mandarin Broadcast News transcription.

4:10, SPEECH-L12.3
SUB-LEXICAL MODELLING USING A FINITE STATE TRANSDUCER FRAMEWORK
X. MOU, V. ZUE
The finite state transducer (FST) approach has been widely used recently as an effective and flexible framework for speech systems. In this framework, a speech recognizer is represented as the composition of a series of FSTs combining various knowledge sources across sub-lexical and high-level linguistic layers. In this paper, we use this FST framework to explore some sub-lexical modelling approaches, and propose a hybrid model that combines an ANGIE morpho-phonemic model with a lexicon-based phoneme network model. These sub-lexical models are converted to FST representations and can be conveniently composed to build the recognizer. Our preliminary perplexity experiments show that the proposed hybrid model has the advantage of imposing strong constraints to the in-vocabulary words as well as providing detailed sub-lexical syllabification and morphology analysis of the out-of-vocabulary (OOV) words. Thus it has the potential of offering good performance and can better handle the OOV problem in speech recognition.

4:30, SPEECH-L12.4
WHAT KIND OF PRONUNCIATION VARIATION IS HARD FOR TRIPHONES TO MODEL?
D. JURAFSKY, W. WARD, J. ZHANG, K. HEROLD, X. YU, S. ZHANG
In order to help understand why gains in pronunciation modeling have proven so elusive, we investigated which kinds of pronunciation variation are well captured by triphone models, and which are not. We do this by examining the change in behavior of a recognizer as it receives further triphone training. We show that many of the kinds of variation which previous pronunciation models attempt to capture, including phone substitution or phone reduction, are in fact already well captured by triphones. Our analysis suggests new areas where future pronunciation models should focus, including syllable deletio

4:50, SPEECH-L12.5
A BOOTSTRAP METHOD FOR CHINESE NEW WORDS EXTRACTION
S. HE, J. ZHU
ABSTRACT A bootstrap approach for extracting unknown words from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-detection approach, the proposed method iteratively extracts the new words and adds them into the lexicon. Then the augmented dictionary, which includes potential unknown words (in addition to known words), is used in the next iteration to re-segment the input corpus until stop conditions are reached. Experiments show that both the precision and recall rates of segmentation are improved.

5:10, SPEECH-L12.6
KANJI-TO-HIRAGANA CONVERSION BASED ON A LANGUAGE MODEL
W. CHANG
In speech recognition systems, a common problem is transcription of new additions to the recognition lexicon into their phonetic symbols. Specific to the Japanese language, such a problem can be dealt with in two steps. In this paper, we focus on the first step, in which the new lexical entry is converted into a set of hiragana syllabaries, which is almost a phonetic transcription. We propose a conversion scheme which yields the most likely hiragana syllabaries, based on a language model. Results from our evaluations on three test sets are also reported. Although the study is conducted on Japanese only, our approach has applications to Chinese.