3:30, SPEECH-L2.1
JOINT PROSODY PREDICTION AND UNIT SELECTION FOR CONCATENATIVE SPEECH SYNTHESIS
I. BULYKO, M. OSTENDORF
In this paper we describe how prosody prediction can be efficiently
integrated with the unit selection process in a concatenative speech
synthesizer under a weighted finite-state transducer (WFST)
architecture. WFSTs representing prosody prediction and unit selection
can be composed during synthesis, thus effectively expanding the
space of possible prosodic targets. We implemented a symbolic prosody
prediction module and a unit selection database as the
synthesis components of a travel planning system. Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.
3:50, SPEECH-L2.2
SELECTING NON-UNIFORM UNITS FROM A VERY LARGE CORPUS FOR CONCATENATIVE SPEECH SYNTHESIZER
M. CHU, H. PENG, H. YANG, E. CHANG
This paper proposes a two-module TTS structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a CART, in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belong to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.
4:10, SPEECH-L2.3
A QUANTITATIVE METHOD FOR MODELING CONTEXT IN CONCATENATIVE SYNTHESIS USING LARGE SPEECH DATABASE
W. HAMZA, M. RASHWAN, M. AFIFY
Modeling phonetic context is one of the key points to get
natural sounding in concatenative speech synthesis. In this
paper, a new quantitative method to model context has
been proposed. In the proposed method, the context is
measured as the distance between leafs of the top-down
likelihood-based decision trees that have been grown
during the construction of acoustic inventory. Unlike other
context modeling methods, this method allows the unit
selection algorithm to borrow unit occurrences from other
contexts when their context distances are close. This is
done by incorporating the measured distance as an
element in the unit selection cost function. The motivation
behind this method is that it reduces the required speech
modification by using better unit occurrences from near
context. This method also makes it easy to use long
synthesis units, e.g. syllables or words, in the same unit
selection framework.
4:30, SPEECH-L2.4
DURATION MODELING IN A RESTRICTED-DOMAIN FEMALE-VOICE SYNTHESIS IN SPANISH USING NEURAL NETWORKS
R. CÓRDOBA, J. MONTERO, J. GUTIERREZ-ARRIOLA, J. PARDO
The objective of this paper is the accurate prediction of segmental duration in a Spanish text-to-speech system. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant and the best way to code them. This work is the continuation of [1], where all efforts were dedicated to an unrestricted-domain database for a male voice. In this case, we are considering a female voice in a restricted-domain environment. This restricted-domain offers several advantages to the modeling: the variation in the different patterns is reduced, and so most of the decisions we have made about the parameters are now based in more significant results. So, the conclusions that we present now show clearly which parameters are best. The system is based in a neural network absolutely configurable.
4:50, SPEECH-L2.5
A FUNCTIONAL ARTICULATORY DYNAMIC MODEL FOR SPEECH PRODUCTION
L. LEE, P. FIEGUTH, L. DENG
This paper introduces a new statistical speech production model. The model synthesizes natural speech by modeling some key dynamic properties of vocal articulators in a linear/nonlinear state-space
framework. The goal-oriented movements of the articulators (tongue
tip, tongue dorsum, upper lip, lower lip, and jaw) are described in a
linear dynamic state equation. The resulting articulatory
trajectories, combined with the effects of the velum and larynx, are
nonlinearly mapped into the acoustic feature space (MFCCs). The key
challenges in this model are the development of a nonlinear parameter
estimation methodology, and the incorporation of appropriate prior
assumptions to assert in the articulatory dynamic structure. Such a
model can also be directly applied to speech recognition to better
account for coarticulation and phonetic reduction phenomena with
considerably fewer parameters than HMM based approaches.
5:10, SPEECH-L2.6
THE EFFECT OF LANGUAGE MODEL PROBABILITY ON PRONUNCIATION REDUCTION
D. JURAFSKY, A. BELL, M. GREGORY, W. RAYMOND
We investigate how the probability of a word affects its
pronunciation. We examined 5618 tokens of the 10 most frequent
(function) words in Switchboard: I, and, the, that, a, you, to, of, it,
and in, and 2042 tokens of content words whose lexical form ends in a t
or d. Our observations were drawn from the phonetically
hand-transcribed subset of the Switchboard corpus, enabling us to code
each word with its pronunciation and duration. Using linear and
logistic regression to control for contextual factors, we show that
words which have a high unigram, bigram, or reverse bigram (given the
following word) probability are shorter, more likely to have a reduced
vowel, and more likely to have a deleted final t or d. These results
suggest that pronunciation models in speech recognition and synthesis
should take into account the probability of words given both the
previous and following words.