SP-3.1

Template-driven generation of prosodic information for Chinese concatenative synthesis
Chung-Hsien Wu, Jau-Hung Chen (Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.)

In this paper, a template-driven generation of prosodic information is proposed for Chinese text-to-speech conversion. A set of monosyllable-based synthesis units is selected from a large continuous speech database. The speech database is employed to establish a word-prosody-based template tree according to the linguistic features: tone combination, word length, part-of-speech (POS) of the word, and word position in a sentence. This template tree stores the prosodic features including pitch contour, average energy, and syllable duration of a word for possible combinations of linguistic features. Two modules for sentence intonation and template selection are proposed to generate the target prosodic templates. The experimental results for the TTS conversion system showed that synthesized prosodic features quite resembled their original counterparts for most syllables in the inside test. Evaluation by subjective experiments also confirmed the satisfactory performance of these approaches.

SP-3.2

Speech Enhancement Using Nonlinear Microphone Array with Complementary Beamforming
Hiroshi Saruwatari (Graduate School of Engineering, Nagoya University), Shoji Kajita (Center for Information Media Studies, Nagoya University), Kazuya Takeda (Graduate School of Engineering, Nagoya University), Fumitada Itakura (Center for Information Media Studies, Nagoya University)

This paper describes an improved spectral subtraction method by using the complementary beamforming microphone array to enhance noisy speech signals for speech recognition. The complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this paper, it is shown that the nonlinear subtraction processing with complementary beamforming can result in a kind of the spectral subtraction without the need for speech pause detection. In addition, the design of the optimization algorithm for the directivity pattern is also described. To evaluate the effectiveness, speech enhancement experiments and speech recognition experiments are performed based on computer simulations. In comparison with the optimized conventional delay-and-sum array, it is shown that the proposed array improves the signal-to-noise ratio of degraded speech by about 2 dB and performs about 10% better in word recognition rates under heavy noisy conditions.

SP-3.3

A Multivariate Speech Activity Detector Based on the Syllable Rate
David C Smith, Jeffrey Townsend, Douglas J Nelson, Dan Richman (U.S. Department of Defense)

Computationally efficient speech extraction algorithms have significant potential economic benefit, by automating an extremely tedious manual process. Previously, algorithms which discriminate between speech and one specific other signal type have been developed, and often fail when the specific non-speech signal is replaced by a different signal type. Moreover, several such signal specific discriminators have been combined with predictable negative results. When the number of discriminating features is large, compression methods such as Principal Components have been applied to reduce dimension, even though information may be lost in the process. In this paper, graphical tools are applied to determine a set of features which produce excellent speech vs. non-speech clustering. This cluster structure provides the basis for a general speech vs. non-speech discriminator, which significantly outperforms the TALKATIVE speech extraction algorithm.

SP-3.4

Discriminating Speakers With Vocal Nodules Using Aerodynamic And Acoustic Features
Jeff Kuo (Bell Laboratories, Lucent Technologies, Murray Hill, NJ 07974), Eva B. Holmberg, Robert E. Hillman (Massachusetts Eye and Ear Infirmary, Boston, MA 02114)

This paper demonstrates that linear discriminant analysis using aerodynamic and acoustic features is effective in discriminating speakers with vocal-fold nodules from normal speakers. Simultaneous aerodynamic and acoustic measurements of vocal function were taken of 14 women with bilateral vocal-fold nodules and 12 women with normal voice production. Features were extracted from the glottal airflow waveform and peaks in the acoustic spectrum for the vowel /\ae/. Results show that the subglottal pressure, air flow, and open quotient are increased in the nodules group. Estimated first-formant bandwidths are increased, but result in minimal change in the first-formant amplitudes. There is no appreciable decrease in high frequency energy. Speakers with nodules may be compensating for the nodules by increasing the subglottal pressure, resulting in relatively good acoustics but increased air flows. The two best features for discrimination are open quotient and subglottal pressure.

SP-3.5

Enhancement of Esophageal Speech Using Formant Synthesis
Kenji Matsui, Noriyo Hara (Central Research Labs. Matsushita Electric Ind. Co., Ltd.)

The feasibility of using the formant analysis-synthesis approach to replace the voicing sources of esophageal speech was explored. The voicing sources were generated by using inverse-filtered signals extracted from normal speakers. Pitch extraction was tested with various pitch extraction methods, then simple auto-correlation method was chosen. Special hardware unit was designed to perform the analysis-synthesis process in real-time. Results of a subjective test showed that the synthesized speech was significantly improved.

SP-3.6

Development of Rules for Controlling the HLsyn Speech Synthesizer
Helen M Hanson, Richard S McGowan, Kenneth N Stevens (Sensimetrics Corp.), Robert E Beaudoin (Sensimetrics Corporation)

In this paper we describe the development of rules to drive a quasi-articulatory speech synthesizer, HLsyn. HLsyn has 13 parameters, which are mapped to the parameters of a formant synthesizer. Its small number of parameters combined with the computational simplicity of a formant synthesizer make it a good basis for a text-to-speech system. An overview of the rule-driven system, called VHLsyn, is presented. The system assumes a phonetic string as input, and produces HLsyn parameter tracks as output. These parameter tracks are then used by HLsyn to produce synthesized speech. Recent work to improve the synthesis of consonants and suprasegmental effects is described, and is shown to improve the quality of the output speech. The improvements include refinement of release characteristics of stop consonants, methods for control of vocal-fold parameters for voiced and voiceless obstruent consonants, and rules for timing and intonation.

SP-3.7

On the characteristics and effects of loudness during utterance production in continuous speech recognition
Daniel Tapias, Carlos Garcia (Telefonica Investigacion y Desarrollo, S.A. Unipersonal), Christophe Cazassus (ENST (Bretagne))

We have checked out that, in speech recognition based telephone applications,the loudness with which the speech signal is produced is a source of degradation of the word accuracy if it is lower or higher than normal. For this reason, we have carried out a research work which has reached three goals: (a) get a better understanding of the Speech Production Loudness (SPL) phenomenon, (b) find out the parameters of the speech recognizer that are the most affected by loudness variations, and (c) compute the effects of SPL and whispery speech in Large Vocabulary Continuous Speech Recognition (LVCSR). In this paper we report the results of this study for three different loudnesses (low, normal and high) and whispery speech. We also report the word accuracy degradation of a continuous speech recognition system when the speech production loudness is different than normal as well as the degradation for whispery speech. The study has been done using the TRESVOL Spanish database, that was designed to study, evaluate and compensate the effect of loudness and whispery speech in LVCSR systems.

SP-3.8

A Multi-Channel Speech/Silence Detector based on Time Delay Estimation and Fuzzy Classification
Francesco Beritelli, Salvatore Casale, Alfredo Cavallaro (Istituto di Informatica e Telecomunicazioni - University of Catania)

Discontinuous transmission based on speech/pause detection represents a valid solution to improve the spectral efficiency of new-generation wireless communication systems. In this context, robust Voice Activity Detection (VAD) algorithms are required, as traditional solutions present a high misclassification rate in the presence of the background noise typical of mobile environments. The Fuzzy Voice Activity Detector (FVAD) recently proposed in [1], shows that a valid alternative to deal with the problem of activity decision is to use methodologies like fuzzy logic. In this paper we propose a multichannel approach to activity detection using both fuzzy logic and time delay estimation. Objective and subjective tests confirm a significant improvement over traditional methods, above all in terms of a reduction in activity increase for non stationary noise.

SP-3.9

Noise Suppression Using A Time-Varying, Analysis/Synthesis Gammachirp Filterbank
Toshio Irino (ATR Human Information Processing Research Laboratory)

Spectral subtraction has been cited most often as a noise suppression method for speech signals in steady background noise, because it is basically a non-parametric method and simple enough to implement for various applications using FFT. It has also been well known, however, that spectral subtraction produces so called "musical noise" in synthetic sounds. Since such musical noise, even at low levels, can often bother humans in speech perception, spectral subtraction has not been very successful in signal processing applications for human listeners. To suppress noise without producing musical noise, an alternative method has been developed using a time-varying, analysis/synthesis gammachirp filterbank; this was initially proposed as an auditory filterbank. The present method achieves about the same SNR improvement as spectral subtraction when using the same information on the non-speech interval. Moreover, the synthetic sounds only contain steady white-like noise at reduced levels when the original noise is white. This method is, therefore, advantageous in various applications for human listeners.

SP-3.10

Experimental Comparison of Signal Subspace Based Noise Reduction Methods
Peter S. K. Hansen, Per C. Hansen, Steffen D. Hansen, John Aa. S�rensen (Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark)

In this paper the signal subspace approach for nonparametric speech enhancement is considered. Several algorithms have been proposed in the literature but only partly analyzed. Here, the different algorithms are compared, and the emphasis is put onto the limiting factors and practical behavior of the estimators. Experimental results show that the signal subspace approach may lead to a significant enhancement of the signal to noise ratio of the output signal.

< SP-2 SP-4 >

Last Update: February 4, 1999 Ingo Höntsch