Session: SPEECH-P8
Time: 1:00 - 3:00, Thursday, May 10, 2001
Location: Exhibit Hall Area 8
Title: Speech Synthesis
Chair: Jose Manuel Pardo

1:00, SPEECH-P8.1
AN RNN-BASED ALGORITHM TO DETECT PROSODIC PHRASE FOR CHINESE TTS
Z. YING, X. SHI
The goal of the work presented here is to automatically predict the prosodic phrase boundaries from the text for Chinese TTS (text-to-speech) by using the trigram of the POS (part-of-speech) with the info of the breaks between the prior two word-pairs by using a RNN (recurrent neural network). Prosodic phrase boundaries are very important to a Chinese TTS system because it will influence the prosodic model for speech synthesis. In this paper, the algorithm tried to use RNN to find some mapping relationship between the POS sequence and prosodic phrase boundaries, and hoped to improve the naturalness of synthesized speech.

1:00, SPEECH-P8.2
DESIGN AND EVALUATION OF A VOICE CONVERSION ALGORITHM BASED ON SPECTRAL ENVELOPE MAPPING AND RESIDUAL PREDICTION
A. KAIN, M. MACON
The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. In this paper, we propose a new algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

1:00, SPEECH-P8.3
GENERATION OF F0 CONTOURS USING A MODEL-CONSTRAINED DATA-DRIVEN METHOD
A. SAKURAI, K. HIROSE, N. MINEMATSU
This paper introduces a novel model-constrained, data-driven method for generating fundamental frequency contours in Japanese text-to-speech synthesis. In the training phase, the parameters of a command-response F0 contour generation model are learned by a prediction model, which can be a neural network or a set of binary regression trees. The input features consist of linguistic information related to accentual phrases that can be automatically extracted from text, such as the position of the accentual phrase in the utterance, number of morae, accent type, and parts-of-speech. In the synthesis phase, the prediction module is used to generate appropriate values of model parameters. The use of the parametric model restricts the degrees of freedom of the problem, facilitating data-driven learning. Experimental results show that the method makes it possible to generate quite natural F0 contours with a relatively small training database.

1:00, SPEECH-P8.4
NEW RULE-BASED AND DATA-DRIVEN STRATEGY TO INCORPORATE FUJISAKI'S F0 MODEL TO A TEXT-TO-SPEECH SYSTEM IN CASTILLIAN SPANISH
J. GUTIÉRREZ-ARRIOLA, J. MONTERO, D. SAIZ, J. PARDO
We will present the analysis of a Spanish prosody database by estimating the parameters of Fujisaki's model for F0 contours. These parameters are classified attending to linguistic features and they form the analysis database. When synthesizing F0 contours we extract the linguistic features from the text and perform a k-Nearest Neighbour search. Linguistic feature comparison distance is trained using data from the prosody database. To avoid artifacts we perform a rule-base filtering on synthesis parameters. The results of our evaluation test show that the proposed system is significantly better than the previous neural network approach. This evaluation confirms the ability of Fujisaki's model to represent prosody information based on linguistic features.

1:00, SPEECH-P8.5
SEGMENTING UNRESTRICTED CHINESE TEXT INTO PROSODIC WORDS INSTEAD OF LEXICAL WORDS
M. CHU, H. PENG, Y. QIAN
This paper stresses the importance of converting a string of lexical words to that of prosodic words in TTS systems by presenting the surface differences and perceptual differences between them. A statistical rule based method and a CART based method are proposed as solutions. Though ComplicatedSet based CART method performs the best, the achievement is obtained at the cost of heavy computation workloads needed by a parser. Statistical rule based method results higher recall but lower precision, comparing to SimpleSet CART method. It is very difficult to tell which is better, since we don¡¯t know which affects naturalness more, precision or recall. Both of them require only lexicon word segmentation and POS tagging in the preprocessing stage, and are easily realized in TTS systems. Results of the preference test discloses that significant improvements on naturalness are perceived when lexical word strings are converted into prosodic word strings by our approach.

1:00, SPEECH-P8.6
TRAINABLE SPEECH SYNTHESIS WITH TRENDED HIDDEN MARKOV MODELS
J. DINES, S. SRIDHARAN
In this paper we present a trainable speech synthesis system that uses the trended Hidden Markov Model to generate the trajectories of spectral features of synthesis units. The synthesis units are trained from a transcribed continuous speech corpus, making the speech more natural than that produced by conventional diphone synthesisers which are generally trained from a highly articulated speech database and require a large investment of time and effort in order to train a new voice. The overall system has been incorporated into a PSOLA synthesiser to produce speech that is natural sounding and preserves the identity of the source speaker.

1:00, SPEECH-P8.7
PERCEPTUAL AND OBJECTIVE DETECTION OF DISCONTINUITIES IN CONCATENATIVE SPEECH SYNTHESIS
Y. STYLIANOU, A. SYRDAL
Concatenative speech synthesis systems attempt to minimize audible signal discontinuities between two successive concatenated units. An objective distance measure which is able to predict audible discontinuities is therefore very important, particularly in unit selection synthesis, for which units are selected from among a large inventory at run time. In this paper, we describe a perceptual test to measure the detection rate of concatenation discontinuity by humans, and then we evaluate 13 different objective distance measures based on their ability to predict the human results. Criteria used to classify these distances include the detection rate, the Bhattacharyya measure of separability of two distributions, and Receiver Operating Characteristic (ROC) curves. Results show that the Kullback-Leibler distance on power spectra has the higher detection rate followed by the Euclidean distance on Mel-Frequency Cepstral Coefficients (MFCC).

1:00, SPEECH-P8.8
VOICE CONVERSION ALGORITHM BASED ON GAUSSIAN MIXTURE MODEL WITH DYNAMIC FREQUENCY WARPING OF STRAIGHT SPECTRUM
T. TODA, H. SARUWATARI, K. SHIKANO
In the voice conversion algorithm based on the Gaussian Mixture Model (GMM) applied to STRAIGHT, quality of converted speech is degraded because the converted spectrum is exceedingly smoothed. In this paper, we propose the GMM-based algorithm with dynamic frequency warping to avoid the over-smoothing. We also propose an addition of the weighted residual spectrum, which is the difference between the GMM-based converted spectrum and the frequency-warped spectrum, to avoid the deterioration of conversion-accuracy on speaker individuality. Results of the evaluation experiments clarify that the converted speech quality is better than that of the GMM-based algorithm, and the conversion-accuracy on speaker individuality is the same as that of the GMM-based algorithm in the proposed method with the properly-weighted residual spectrum.

1:00, SPEECH-P8.9
ADAPTATION OF PITCH AND SPECTRUM FOR HMM-BASED SPEECH SYNTHESIS USING MLLR
M. TAMURA, T. MASUKO, K. TOKUDA, T. KOBAYASHI
This paper describes a technique for synthesizing speech with an arbitrary speaker characteristics using speaker independent speech units, which we call "average voice" units. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. In the HMM-based TTS system, speech synthesis units are modeled by multi-space probability distribution (MSD) HMMs which can model spectrum and pitch simultaneously in a unified framework. We derive an extension of the MLLR algorithm to apply it to MSD-HMMs. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

1:00, SPEECH-P8.10
SPEECH SYNTHESIS USING STOCHASTIC MARKOV GRAPHS
M. EICHNER, M. WOLFF, S. OHNEWALD, R. HOFFMANN
Speech synthesis systems based on concatenation of natural speech segments achieve a high quality in terms of naturalness and intelligibility. However, in many applications such systems are not easy to apply because of the huge demand for storage capacity. Speech synthesis systems based on HMMs could be an alternative to concatenative speech synthesis systems but do not yet achieve the quality needed for use in applications. In this context we are examining the suitability of Stochastic Markov Graphs instead of HMMs to improve the performance of such synthesis systems. This paper describes the training procedure we used to train the SMGs, explains the synthesis process and introduces an algorithm for state selection and state duration modeling. We focus particularly on issues which arise using SMGs instead of HMMs.