Chair: Yoshinori Sagisaka, AT&T Bell Laboratories (USA)
Douglas O'Shaughnessy, Universite du Quebec (CANADA)
We examine and model global speaking rate, how it varies for both fluent and disfluent spontaneous speech, in terms of the linguistic content of the utterances. Speakers tend to maintain a fixed speaking rate during most utterances, but often adopt a faster or slower rate, depending on the cognitive load (i.e., slowing down when having to make unanticipated choices, or accelerating when repeating some words). Such a model can find application in automatic speech synthesis and recognition, because most synthesizers maintain a constant (and unnatural) speaking rate and most recognizers are not capable of adapting their templates or probabilistic models to reflect global changes in speaking rate.
Shigeru Fujio, ATR-ITL (JAPAN)
Yoshinori Sagisaka, ATR-ITL (JAPAN)
Norio Higuchi, ATR-ITL (JAPAN)
In this paper, we propose a model for predicting pause insertion using a stochastic context-free grammar (SCFG) for an input part of speech sequence. In this model, word attributes and stochastic phrasing information obtained by a SCFG trained using phrase dependency bracketings and bracketings based on pause locations are used. Using the Inside-Outside algorithm for training, corpora with phrase dependency brackets are first used to train the SCFG from scratch. Next, this SCFG is re-trained using the same corpora with bracketings based on pause locations. Then, the probabilities of each bracketing structure are computed using the SCFG, and these are used as parameters in the prediction of the pause locations. Experiments were carried out to confirm the effectiveness of the stochastic model for the prediction of pause locations. In test with open data, 85.2% of the pause boundaries and 90.9% of the no-pause boundaries were correctly predicted.
Louis F. M. ten Bosch, Institute for Perception Research (THE NETHERLANDS)
In this paper, we study to what extent pitch movements in utterances can be classified automatically by using acoustical information and an intonation grammar. It will be shown that pitch movements can be classified into six categories with an agreement of about 80 percent compared with human transcriptions, on the basis of the pitch contour and the moments of vowel onsets. These six categories cover about 90 percent of all pitch movements used in a database (elicited speech). Results involving an intonation grammar are also presented.
Matthew A. Siegler, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)
It is well known that fast speech increases the recognition error rate in large vocabulary automatic speech recognition systems. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We then propose three methods to improve the recognition of fast speech. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, dictionaries are modified using rule-based techniques and compound words are added. Adaptation of the HMM state-transition probabilities improves recognition of the fastest speech by a relative rate of 4 to 6 percent.
Shaw-Hwa Hwang, National Chiao Tung University (REPUBLIC OF CHINA)
Sin-Horng Chen, National Chiao Tung University (REPUBLIC OF CHINA)
A prosodic model of Mandarin speech is proposed to simulate human's pronunciation mechanism for exploring the hidden pronunciation states embedded in the input text. Parameters representing these pronunciation states are then used to assist prosody information generation. A multirate recurrent neural network (MRNN) is employed to realize the prosodic model. Two learning methods were proposed to train the MRNN. One is an indirect method which firstly uses an additional SRNN to track the dynamics of the prosody information of the utterance; and then takes the outputs of its hidden layer as desired targets to train the MRNN. The other is a direct training method which integrates the MRNN and the following MLP prosody synthesizers to directly learn the relation between the input linguistic features and the output prosody information. Simulation results confirmed the effectiveness of the approach. Most synthesized prosodic parameter sequences match quite well with their original counterparts.
Karen Ward, Oregon Graduate Institute of Science & Technology(USA)
David G. Novick, Oregon Graduate Institute of Science & Technology(USA)
In this study we examined prosodic characteristics of a word used in several distinct senses in a task-oriented corpus of spontaneous speech. We compared the pitch characteristics of the word "right" used in three different senses: as an acknowledgment, as a direction, and as an affirmative answer to a question. Significant differences in intonation for different classes of usage were found, although the differences are not reliable enough to allow systems to use prosody alone to distinguish between usages. These results suggest that pitch change as reported by a pitch tracker could serve as a confirming cue when analyzing ambiguous speech recognizer output, or could serve as input to a probabilistic parser to aid in disambiguating senses of homonyms.
Mitsuru Nakai, Tohoku University
Harald Singer, ATR-ITL
Yoshinori Sagisaka, ATR-ITL
Hiroshi Shimodaira, JAIST (JAPAN)
In this paper, we propose an automatic method for detecting accent phrase boundaries in Japanese continuous speech by using F_0 information. In the training phase, hand labeled accent patterns are parameterized according to a superpositional model proposed by Fujisaki, and assigned to some clusters by a clustering method, in which accent templates are calculated as centroid of each cluster. In the segmentation phase, automatic N-best extraction of boundaries is performed by One-Stage DP matching between the reference templates and the target F_0 contour. About 90% of accent phrase boundaries were correctly detected in speaker independent experiments with the ATR Japanese continuous speech database.
Anastasios Anastasakos, Northeastern University
Richard Schwartz, BBN Systems and Technologies (USA)
Han Shu, BBN Systems and Technologies (USA)
This paper presents a study of different methods for phoneme duration modeling in large vocabulary speech recognition. We investigate the employment of phoneme duration and the effect of context, speaking rate and lexical stress in the duration of phoneme segments in a large vocabulary speech recognition system. The duration models are used in a postprocessing phase of BYBLOS, our baseline HMM-based recognition system, to rescore the N-Best hypotheses. We describe experiments with the 5K word ARPA Wall Street Journal (WSJ) corpus. The results show that integration of duration models that take into account context and speaking rate can improve the word accuracy of the baseline recognition system.
Siripong Potisuk, Purdue University (USA)
Mary P. Harper, Purdue University (USA)
Jackson T. Gandour, Purdue University (USA)
Tone classification is a crucial component of any automatic speech recognition system for tone languages. It is imperative that tonal information be incorporated into the word hypothesization process because patterns of pitch (or tones) contribute to the lexical identification of the individual words. In this paper, we present a novel algorithm for automatically classifying Thai tones in connected speech using an analysis-synthesis method based on an extension to Fujisaki's model. We have successfully incorporated into the model two major factors affecting the phonetic realization of tones in connected speech: tonal coarticulation and declination. Also addressed is an F0 normalization procedure for achieving speaker-independence. In our preliminary experiment, we were able to achieve 89.1% classification accuracy.