3:30, SPEECH-L12.1
AUTOMATIC GENERATION AND SELECTION OF MULTIPLE PRONUNCIATIONS FOR DYNAMIC VOCABULARIES
S. DELIGNE, B. MAISON, R. GOPINATH
In this paper, we present a new scheme for the acoustic modeling of speech recognition applications requiring dynamic vocabularies. It applies especially to the acoustic modeling of out-of-vocabulary words which need to be added to a recognition lexicon based on the observation of a few (say one or two) speech utterances of these words. Standard approaches to this problem derive a single pronunciation from each speech utterance by combining acoustic and phone transition scores. In our scheme, multiple pronunciations are generated from each speech utterance of a word to enroll by varying the relative weights assigned to the acoustic and phone transition models. In our experiments, the use of these multiple baseforms dramatically outperforms the standard approach with a relative decrease of the word error rate ranging from 20% to 40% on all our test sets.
3:50, SPEECH-L12.2
AUTOMATIC GENERATION OF PRONUNCIATION LEXICONS FOR MANDARIN SPONTANEOUS SPEECH
V. VENKATARAMANI, W. BYRNE, T. KAMM, T. ZHENG, Z. SONG, P. FUNG, L. YI, U. RUHI
Pronunciation modeling for large vocabulary speech recognition
attempts to improve recognition accuracy by identifying and modeling
pronunciations that are not in the ASR systems pronunciation lexicon.
Pronunciation variability in spontaneous Mandarin is studied using the
newly created CASS corpus of phonetically annotated spontaneous
speech. Pronunciation modeling techniques developed for English are
applied to this corpus to train pronunciation models which are then
used for Mandarin Broadcast News transcription.
4:10, SPEECH-L12.3
SUB-LEXICAL MODELLING USING A FINITE STATE TRANSDUCER FRAMEWORK
X. MOU, V. ZUE
The finite state transducer (FST) approach has been
widely used recently as an effective and flexible framework for speech
systems. In this framework, a speech recognizer is represented as
the composition of a series of FSTs combining various knowledge
sources across sub-lexical and high-level linguistic layers. In this
paper, we use this FST framework to explore some sub-lexical modelling
approaches, and propose a hybrid model that combines an
ANGIE morpho-phonemic model with a lexicon-based phoneme
network model. These sub-lexical models are converted to FST
representations and can be conveniently composed to build the
recognizer. Our preliminary perplexity experiments show that the
proposed hybrid model has the advantage of imposing strong constraints
to the in-vocabulary words as well as providing detailed sub-lexical
syllabification and morphology analysis of the out-of-vocabulary (OOV)
words. Thus it has the potential of offering good performance and can
better handle the OOV problem in speech recognition.
4:30, SPEECH-L12.4
WHAT KIND OF PRONUNCIATION VARIATION IS HARD FOR TRIPHONES TO MODEL?
D. JURAFSKY, W. WARD, J. ZHANG, K. HEROLD, X. YU, S. ZHANG
In order to help understand why gains in pronunciation modeling have
proven so elusive, we investigated which kinds of pronunciation
variation are well captured by triphone models, and which are not. We
do this by examining the change in behavior of a recognizer as it
receives further triphone training. We show that many of the kinds of
variation which previous pronunciation models attempt to capture,
including phone substitution or phone reduction, are in fact already
well captured by triphones. Our analysis suggests new areas where
future pronunciation models should focus, including syllable deletio
4:50, SPEECH-L12.5
A BOOTSTRAP METHOD FOR CHINESE NEW WORDS EXTRACTION
S. HE, J. ZHU
ABSTRACT
A bootstrap approach for extracting unknown words from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-detection approach, the proposed method iteratively extracts the new words and adds them into the lexicon. Then the augmented dictionary, which includes potential unknown words (in addition to known words), is used in the next iteration to re-segment the input corpus until stop conditions are reached. Experiments show that both the precision and recall rates of segmentation are improved.
5:10, SPEECH-L12.6
KANJI-TO-HIRAGANA CONVERSION BASED ON A LANGUAGE MODEL
W. CHANG
In speech recognition systems, a common problem
is transcription of new additions to the recognition lexicon
into their phonetic symbols.
Specific to the Japanese language, such a problem can be dealt with
in two steps.
In this paper, we focus on the first step, in which the new lexical
entry is converted into a set of hiragana syllabaries, which is almost
a phonetic transcription.
We propose a conversion scheme which yields the most likely
hiragana syllabaries, based on a language model.
Results from our evaluations on three test sets are also reported.
Although the study is conducted on Japanese only, our approach has
applications to Chinese.