SPEECH SYNTHESIS & PRODUCTION

Chair: Kathleen Cummings, Georgia Institute of Technology (USA)

Home


Speech Synthesis System Based on a Variable Decimation/Interpolation Factor

Authors:

F. M. Gimenez de los Galanes, Univerisity Politecnica de Madrid
M. H. Savoji, Universidad de Cantabria
J. M. Pardo, Univerisity Politecnica de Madrid (SPAIN)

Volume 1, Page 636

Abstract:

In this paper we present a modification of the usual decimation-interpolation steps for resampling of speech signal which is especially adapted to arbitrary modification of fundamental frequency and duration of speech segments. The modification is intended to overcome the time and frequency domain limitation that such a resampling scheme imposes so it can be used is a speech synthesis system. The performance of this resampling method for prosody modification is better than the equivalent PSOLA (Pitch-Synchronous Overlap-Add) method when working at a sampling frequency of 8 or 10 kilohertz so the source spectrum of the voiced allophones can be said to be completely harmonical. An optimization of the proposed algorithm that allows a real time implementation is also discussed.

300dpi TIFF Images of pages:

636 637 638 639

Acrobat PDF file of whole paper:

ic950636.pdf

TOP



Automatic Speech Synthesiser Parameter Estimation Using HMMs

Authors:

R.E. Donovan, Cambridge University (UK)
P.C. Woodland, Cambridge University (UK)

Volume 1, Page 640

Abstract:

This paper presents a new approach to speech synthesis which uses a set of decision tree state clustered triphone HMMs to automatically segment a single speaker speech database into sub-word units suitable for use in a synthesiser. Parameters are then obtained for each of these sub-word units from the segmented database, enabling a basic synthesis system to be constructed. This automatic generation of synthesis parameters means that the system can easily be retrained on a new speaker, whose voice it then mimics. It also means that a very large number of sub-word units can be used, which enables more precise context modelling than was previously possible.

300dpi TIFF Images of pages:

640 641 642 643

Acrobat PDF file of whole paper:

ic950640.pdf

TOP



Speaker Modification with LPC Pole Analysis

Authors:

Janet Slifka, Systems Research Laboratories
Timothy R. Anderson, Armstrong Laboratory (USA)

Volume 1, Page 644

Abstract:

Speaker modification is the ability to change the perceived speaker identity of a recorded utterance. Basic to this is the capability to alter the vowel segments of speech. Not only do these segments comprise the majority of the voiced portion of speech but they are dominated by clearly defined acoustic parameters - formant frequencies and pitch. A method of altering the formant frequencies of vowel segments using LPC analysis/synthesis was investigated. Pole location modification based on statistical references provided individual control over formant frequencies and bandwidths but, in some transformations, lead to artifacts in the reconstructed speech.

300dpi TIFF Images of pages:

644 645 646 647

Acrobat PDF file of whole paper:

ic950644.pdf

TOP



Synthesizing Styled Speech Using the Klatt Synthesizer

Authors:

Janet C. Rutledge, Northwestern University
Kathleen E. Cummings, Georgia Institute of Technology (USA)
Daniel A. Lambert, Georgia Institute of Technology (USA)
Mark A. Clements, Georgia Institute of Technology (USA)

Volume 1, Page 648

Abstract:

This paper reports the implementation of high-quality synthesis of speech with varying speaking styles using the Klatt synthesizer. This research is based on previously-reported research that determined that the glottal waveforms of various styles of speech are significantly and identifiably different. Given the parameter tracks that control the synthesis of a normal version of an utterance, those parameters that control known acoustic correlates of speaking style are varied appropriately, relative to normal, to synthesize styled speech. In addition to varying the parameters that control the glottal waveshape, phoneme duration, phoneme intensity, and pitch contour are also varied appropriately. Listening tests that demonstrate that the synthetic speech is perceptibly and appropriately styled, and that the synthetic speech is natural-sounding, were performed, and the results are presented in this paper.

300dpi TIFF Images of pages:

648 649 650 651

Acrobat PDF file of whole paper:

ic950648.pdf

TOP



Acoustical Measurements of the Vocal-Tract Area Function: Sensitivity Analysis and Experiments

Authors:

Hani Yehia, Nagoya University
Masaaki Honda, NTT Basic Research Laboratories
Fumitada Itakura, Nagoya University (JAPAN)

Volume 1, Page 652

Abstract:

A method to determine the vocal-tract cross-sectional area function from acoustical measurements at the lips is analyzed here. Under the framework described by Sondhi and Gopinath (1971) and implemented by Sondhi and Resnick (1983), a sensitivity analysis of the vocal-tract area function, derived from the impedance or reflectance at the lips is performed. It indicates that, in the ideal case, the area function is not heavily affected by random distortions of the impulse response at the lips. Simulations and real measurements show that the method works relatively well, except for regions behind narrow constrictions. In this case, an excitation pulse with high energy, as well as a fine sampling, proved to be important. The excitation used here is a time stretched pulse. It produces an excitation with high energy without the necessity of a high power sound generator device.

300dpi TIFF Images of pages:

652 653 654 655

Acrobat PDF file of whole paper:

ic950652.pdf

TOP



Shape-Invariant Pitch-Synchronous Text-to- Speech Conversion

Authors:

Eduardo R. Banga, Universidad de Vigo (SPAIN)
Carmen Garcia-Mateo, Universidad de Vigo (SPAIN)

Volume 1, Page 656

Abstract:

Text-to-speech (T-T-S) systems based on the concatenation of speech units need a prosodic modification algorithm to adjust the prosodic features of the stored speech units to the desired output values. In this paper, we discuss the application of a sinusoidal shape-invariant model to a T-T-S system for Spanish, paying special attention to the concatenation issues and phase treatment. As a result, the synthetic speech waveform resembles the waveform of its contributor units, without sounding reverberant as in other sinusoidal implementations.

300dpi TIFF Images of pages:

656 657 658 659

Acrobat PDF file of whole paper:

ic950656.pdf

TOP



Speech Parameter Generation from HMM Using Dynamic Features

Authors:

Keiichi Tokuda, Tokyo Institute of Technology (JAPAN)
Takao Kobayashi, Tokyo Institute of Technology (JAPAN)
Satoshi Imai, Tokyo Institute of Technology (JAPAN)

Volume 1, Page 660

Abstract:

This paper proposes an algorithm for speech parameter generation from HMMs which include the dynamic features. The performance of speech recognition based on HMMs has been improved by introducing the dynamic features of speech. Thus we surmise that, if there is a method for speech parameter generation from HMMs which include the dynamic features, it will be useful for speech synthesis by rule. It is shown that the parameter generation from HMMs using the dynamic features results in searching for the optimum state sequence and solving a set of linear equations for each possible state sequence. We derive a fast algorithm for the solution by the analogy of the RLS algorithm for adaptive filtering. We also show the effect of incorporating the dynamic features by an example of speech parameter generation.

300dpi TIFF Images of pages:

660 661 662 663

Acrobat PDF file of whole paper:

ic950660.pdf

TOP



A Source Generator Based Modeling Framework for Synthesis of Speech Under Stress

Authors:

Sahar E. Bou-Ghazale, Duke University (USA)
John H. L. Hansen, Duke University (USA)

Volume 1, Page 664

Abstract:

The objective of this paper is to formulate an algorithm to generate stressed synthetic speech from neutral speech using a source generator framework previously employed for stressed speech recognition [2,3]. The following goals are addressed (i) identify the most visible indicators of stress as perceived by the listener in stressed speaking styles such as loud, Lombard effect and angry, (ii) develop a mathematical model for representing speech production under stressed conditions, and (iii) employ the above model to produce emotional/stressed synthetic speech from neutral speech. The stress modeling scheme is applied to an existing low- bit rate CELP speech coder in order to investigate (i) the coder's ability and limitations reproducing stressed synthetic speech, and (ii) our ability to perturb coded neutral speech parameters at the synthesis stage so that resulting speech is perceived as being under stress. Two stress perturbation algorithms are proposed and evaluated. Results from formal listener evaluations show that 87% of neutral perturbed speech was indeed perceived as stressed.

300dpi TIFF Images of pages:

664 665 666 667

Acrobat PDF file of whole paper:

ic950664.pdf

TOP



MBE Synthesis of Speech Coded in LPC Format

Authors:

K.F. Lam, City Polytechnic of Hong Kong (HONG KONG)
C.F. Chan, City Polytechnic of Hong Kong (HONG KONG)

Volume 1, Page 668

Abstract:

A method to produce high quality speech from signals coded in linear predictive coding (LPC) format is proposed. In this method, speech signals coded in LPC format is synthesized via multiband excitation (MBE) model. Since MBE model demands more parameters than LPC model can provide, techniques had been developed to regenerate the necessarily information for MBE synthesis from LPC coded speech. A V/UV regeneration scheme for MBE synthesis is proposed to extract the V/UV decisions from spectral envelopes by a long term statistical training approach. Informal listening shows that the method provides a significant improvement of speech quality over conventional LPC synthesis. Simulation shows that the synthetic speech produced by the proposed method is comparable to the speech synthesized using MBE.

300dpi TIFF Images of pages:

668 669 670 671

Acrobat PDF file of whole paper:

ic950668.pdf

TOP



Modeling Speech Production Using Yee's Finite Difference Method

Authors:

Kathleen E. Cummings, Georgia Institute of Technology
James G. Maloney, Georgia Technical Research Institute
Mark A. Clements, Georgia Institute of Technology (USA)

Volume 1, Page 672

Abstract:

This paper describes a model of speech production based on solving for acoustic wave propagation in the vocal tract using a finite-difference time-domain (FDTD) technique. This FDTD technique was first developed by Yee and utilizes a discretization scheme in which pressure and velocity components are interleaved in both space and time. The specific implementation of this model of speech production, including discretization of the coupled acoustic wave equations, boundary conditions, stability criteria, values of model constants, and method of excitation, are presented in this paper. The accuracy of the model is verified by comparing the FDTD results to the theoretically expected results for a well-known acoustics problem. The FDTD model of speech production has been used in a variety of experiments, and several results, including those that compare the use of several common glottal models as excitation, are presented here.

300dpi TIFF Images of pages:

672 673 674 675

Acrobat PDF file of whole paper:

ic950672.pdf

TOP