Session: SPEECH-L2
Time: 3:30 - 5:30, Tuesday, May 8, 2001
Location: Room 151
Title: Speech Synthesis and Production
Chair: Juergen Schroeter

3:30, SPEECH-L2.1
JOINT PROSODY PREDICTION AND UNIT SELECTION FOR CONCATENATIVE SPEECH SYNTHESIS
I. BULYKO, M. OSTENDORF
In this paper we describe how prosody prediction can be efficiently integrated with the unit selection process in a concatenative speech synthesizer under a weighted finite-state transducer (WFST) architecture. WFSTs representing prosody prediction and unit selection can be composed during synthesis, thus effectively expanding the space of possible prosodic targets. We implemented a symbolic prosody prediction module and a unit selection database as the synthesis components of a travel planning system. Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.

3:50, SPEECH-L2.2
SELECTING NON-UNIFORM UNITS FROM A VERY LARGE CORPUS FOR CONCATENATIVE SPEECH SYNTHESIZER
M. CHU, H. PENG, H. YANG, E. CHANG
This paper proposes a two-module TTS structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a CART, in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belong to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.

4:10, SPEECH-L2.3
A QUANTITATIVE METHOD FOR MODELING CONTEXT IN CONCATENATIVE SYNTHESIS USING LARGE SPEECH DATABASE
W. HAMZA, M. RASHWAN, M. AFIFY
Modeling phonetic context is one of the key points to get natural sounding in concatenative speech synthesis. In this paper, a new quantitative method to model context has been proposed. In the proposed method, the context is measured as the distance between leafs of the top-down likelihood-based decision trees that have been grown during the construction of acoustic inventory. Unlike other context modeling methods, this method allows the unit selection algorithm to borrow unit occurrences from other contexts when their context distances are close. This is done by incorporating the measured distance as an element in the unit selection cost function. The motivation behind this method is that it reduces the required speech modification by using better unit occurrences from near context. This method also makes it easy to use long synthesis units, e.g. syllables or words, in the same unit selection framework.

4:30, SPEECH-L2.4
DURATION MODELING IN A RESTRICTED-DOMAIN FEMALE-VOICE SYNTHESIS IN SPANISH USING NEURAL NETWORKS
R. CÓRDOBA, J. MONTERO, J. GUTIERREZ-ARRIOLA, J. PARDO
The objective of this paper is the accurate prediction of segmental duration in a Spanish text-to-speech system. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant and the best way to code them. This work is the continuation of [1], where all efforts were dedicated to an unrestricted-domain database for a male voice. In this case, we are considering a female voice in a restricted-domain environment. This restricted-domain offers several advantages to the modeling: the variation in the different patterns is reduced, and so most of the decisions we have made about the parameters are now based in more significant results. So, the conclusions that we present now show clearly which parameters are best. The system is based in a neural network absolutely configurable.

4:50, SPEECH-L2.5
A FUNCTIONAL ARTICULATORY DYNAMIC MODEL FOR SPEECH PRODUCTION
L. LEE, P. FIEGUTH, L. DENG
This paper introduces a new statistical speech production model. The model synthesizes natural speech by modeling some key dynamic properties of vocal articulators in a linear/nonlinear state-space framework. The goal-oriented movements of the articulators (tongue tip, tongue dorsum, upper lip, lower lip, and jaw) are described in a linear dynamic state equation. The resulting articulatory trajectories, combined with the effects of the velum and larynx, are nonlinearly mapped into the acoustic feature space (MFCCs). The key challenges in this model are the development of a nonlinear parameter estimation methodology, and the incorporation of appropriate prior assumptions to assert in the articulatory dynamic structure. Such a model can also be directly applied to speech recognition to better account for coarticulation and phonetic reduction phenomena with considerably fewer parameters than HMM based approaches.

5:10, SPEECH-L2.6
THE EFFECT OF LANGUAGE MODEL PROBABILITY ON PRONUNCIATION REDUCTION
D. JURAFSKY, A. BELL, M. GREGORY, W. RAYMOND
We investigate how the probability of a word affects its pronunciation. We examined 5618 tokens of the 10 most frequent (function) words in Switchboard: I, and, the, that, a, you, to, of, it, and in, and 2042 tokens of content words whose lexical form ends in a t or d. Our observations were drawn from the phonetically hand-transcribed subset of the Switchboard corpus, enabling us to code each word with its pronunciation and duration. Using linear and logistic regression to control for contextual factors, we show that words which have a high unigram, bigram, or reverse bigram (given the following word) probability are shorter, more likely to have a reduced vowel, and more likely to have a deleted final t or d. These results suggest that pronunciation models in speech recognition and synthesis should take into account the probability of words given both the previous and following words.