PROSODY FOR SYNTHESIS & RECOGNITION

Chair: Yoshinori Sagisaka, AT&T Bell Laboratories (USA)

Home

Timing Patterns in Fluent and Disfluent Spontaneous Speech

Authors:

Douglas O'Shaughnessy, Universite du Quebec (CANADA)

Volume 1, Page 600

Abstract:

We examine and model global speaking rate, how it varies for both fluent and disfluent spontaneous speech, in terms of the linguistic content of the utterances. Speakers tend to maintain a fixed speaking rate during most utterances, but often adopt a faster or slower rate, depending on the cognitive load (i.e., slowing down when having to make unanticipated choices, or accelerating when repeating some words). Such a model can find application in automatic speech synthesis and recognition, because most synthesizers maintain a constant (and unnatural) speaking rate and most recognizers are not capable of adapting their templates or probabilistic models to reflect global changes in speaking rate.

300dpi TIFF Images of pages:

600 601 602 603

Acrobat PDF file of whole paper:

ic950600.pdf

TOP

Stochastic Modeling of Pause Insertion Using Context- Free Grammar

Authors:

Shigeru Fujio, ATR-ITL (JAPAN)
Yoshinori Sagisaka, ATR-ITL (JAPAN)
Norio Higuchi, ATR-ITL (JAPAN)

Volume 1, Page 604

Abstract:

In this paper, we propose a model for predicting pause insertion using a stochastic context-free grammar (SCFG) for an input part of speech sequence. In this model, word attributes and stochastic phrasing information obtained by a SCFG trained using phrase dependency bracketings and bracketings based on pause locations are used. Using the Inside-Outside algorithm for training, corpora with phrase dependency brackets are first used to train the SCFG from scratch. Next, this SCFG is re-trained using the same corpora with bracketings based on pause locations. Then, the probabilities of each bracketing structure are computed using the SCFG, and these are used as parameters in the prediction of the pause locations. Experiments were carried out to confirm the effectiveness of the stochastic model for the prediction of pause locations. In test with open data, 85.2% of the pause boundaries and 90.9% of the no-pause boundaries were correctly predicted.

300dpi TIFF Images of pages:

604 605 606 607

Acrobat PDF file of whole paper:

ic950604.pdf

TOP

Automatic Classification of Pitch Movements via MLP-Based Estimation of Class Probabilities

Authors:

Louis F. M. ten Bosch, Institute for Perception Research (THE NETHERLANDS)

Volume 1, Page 608

Abstract:

In this paper, we study to what extent pitch movements in utterances can be classified automatically by using acoustical information and an intonation grammar. It will be shown that pitch movements can be classified into six categories with an agreement of about 80 percent compared with human transcriptions, on the basis of the pitch contour and the moments of vowel onsets. These six categories cover about 90 percent of all pitch movements used in a database (elicited speech). Results involving an intonation grammar are also presented.

300dpi TIFF Images of pages:

608 609 610 611

Acrobat PDF file of whole paper:

ic950608.pdf

TOP

On the Effects of Speech Rate in Large Vocabulary Speech Recognition Systems

Authors:

Matthew A. Siegler, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)

Volume 1, Page 612

Abstract:

It is well known that fast speech increases the recognition error rate in large vocabulary automatic speech recognition systems. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We then propose three methods to improve the recognition of fast speech. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, dictionaries are modified using rule-based techniques and compound words are added. Adaptation of the HMM state-transition probabilities improves recognition of the fastest speech by a relative rate of 4 to 6 percent.

300dpi TIFF Images of pages:

612 613 614 615

Acrobat PDF file of whole paper:

ic950612.pdf

TOP

A Prosodic Model of Mandarin Speech and Its Application to Pitch Level Generation for Text -to-Speech

Authors:

Shaw-Hwa Hwang, National Chiao Tung University (REPUBLIC OF CHINA)
Sin-Horng Chen, National Chiao Tung University (REPUBLIC OF CHINA)

Volume 1, Page 616

Abstract:

A prosodic model of Mandarin speech is proposed to simulate human's pronunciation mechanism for exploring the hidden pronunciation states embedded in the input text. Parameters representing these pronunciation states are then used to assist prosody information generation. A multirate recurrent neural network (MRNN) is employed to realize the prosodic model. Two learning methods were proposed to train the MRNN. One is an indirect method which firstly uses an additional SRNN to track the dynamics of the prosody information of the utterance; and then takes the outputs of its hidden layer as desired targets to train the MRNN. The other is a direct training method which integrates the MRNN and the following MLP prosody synthesizers to directly learn the relation between the input linguistic features and the output prosody information. Simulation results confirmed the effectiveness of the approach. Most synthesized prosodic parameter sequences match quite well with their original counterparts.

300dpi TIFF Images of pages:

616 617 618 619

Acrobat PDF file of whole paper:

ic950616.pdf

TOP

Prosodic Cues to Word Usage

Authors:

Karen Ward, Oregon Graduate Institute of Science & Technology(USA)
David G. Novick, Oregon Graduate Institute of Science & Technology(USA)

Volume 1, Page 620

Abstract:

In this study we examined prosodic characteristics of a word used in several distinct senses in a task-oriented corpus of spontaneous speech. We compared the pitch characteristics of the word "right" used in three different senses: as an acknowledgment, as a direction, and as an affirmative answer to a question. Significant differences in intonation for different classes of usage were found, although the differences are not reliable enough to allow systems to use prosody alone to distinguish between usages. These results suggest that pitch change as reported by a pitch tracker could serve as a confirming cue when analyzing ambiguous speech recognizer output, or could serve as input to a probabilistic parser to aid in disambiguating senses of homonyms.

300dpi TIFF Images of pages:

620 621 622 623

Acrobat PDF file of whole paper:

ic950620.pdf

TOP

Automatic Prosodic Segmentation by Fo Clustering Using Superpositional Modeling

Authors:

Mitsuru Nakai, Tohoku University
Harald Singer, ATR-ITL
Yoshinori Sagisaka, ATR-ITL
Hiroshi Shimodaira, JAIST (JAPAN)

Volume 1, Page 624

Abstract:

In this paper, we propose an automatic method for detecting accent phrase boundaries in Japanese continuous speech by using F_0 information. In the training phase, hand labeled accent patterns are parameterized according to a superpositional model proposed by Fujisaki, and assigned to some clusters by a clustering method, in which accent templates are calculated as centroid of each cluster. In the segmentation phase, automatic N-best extraction of boundaries is performed by One-Stage DP matching between the reference templates and the target F_0 contour. About 90% of accent phrase boundaries were correctly detected in speaker independent experiments with the ATR Japanese continuous speech database.

300dpi TIFF Images of pages:

624 625 626 627

Acrobat PDF file of whole paper:

ic950624.pdf

TOP

Duration Modeling in Large Vocabulary Speech Recognition

Authors:

Anastasios Anastasakos, Northeastern University
Richard Schwartz, BBN Systems and Technologies (USA)
Han Shu, BBN Systems and Technologies (USA)

Volume 1, Page 628

Abstract:

This paper presents a study of different methods for phoneme duration modeling in large vocabulary speech recognition. We investigate the employment of phoneme duration and the effect of context, speaking rate and lexical stress in the duration of phoneme segments in a large vocabulary speech recognition system. The duration models are used in a postprocessing phase of BYBLOS, our baseline HMM-based recognition system, to rescore the N-Best hypotheses. We describe experiments with the 5K word ARPA Wall Street Journal (WSJ) corpus. The results show that integration of duration models that take into account context and speaking rate can improve the word accuracy of the baseline recognition system.

300dpi TIFF Images of pages:

628 629 630 631

Acrobat PDF file of whole paper:

ic950628.pdf

TOP

Speaker-Independent Automatic Classification of Thai Tones in Connected Speech by Analysis- Synthesis Method

Authors:

Siripong Potisuk, Purdue University (USA)
Mary P. Harper, Purdue University (USA)
Jackson T. Gandour, Purdue University (USA)

Volume 1, Page 632

Abstract:

Tone classification is a crucial component of any automatic speech recognition system for tone languages. It is imperative that tonal information be incorporated into the word hypothesization process because patterns of pitch (or tones) contribute to the lexical identification of the individual words. In this paper, we present a novel algorithm for automatically classifying Thai tones in connected speech using an analysis-synthesis method based on an extension to Fujisaki's model. We have successfully incorporated into the model two major factors affecting the phonetic realization of tones in connected speech: tonal coarticulation and declination. Also addressed is an F0 normalization procedure for achieving speaker-independence. In our preliminary experiment, we were able to achieve 89.1% classification accuracy.

300dpi TIFF Images of pages:

632 633 634 635

Acrobat PDF file of whole paper:

ic950632.pdf