Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-3 |
|
SP-3.1
|
Template-driven generation of prosodic information for Chinese concatenative synthesis
Chung-Hsien Wu,
Jau-Hung Chen (Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.)
In this paper, a template-driven generation of prosodic information is proposed for Chinese text-to-speech conversion. A set of monosyllable-based synthesis units is selected from a large continuous speech database. The speech database is employed to establish a word-prosody-based template tree according to the linguistic features: tone combination, word length, part-of-speech (POS) of the word, and word position in a sentence. This template tree stores the prosodic features including pitch contour, average energy, and syllable duration of a word for possible combinations of linguistic features. Two modules for sentence intonation and template selection are proposed to generate the target prosodic templates. The experimental results for the TTS conversion system showed that synthesized prosodic features quite resembled their original counterparts for most syllables in the inside test. Evaluation by subjective experiments also confirmed the satisfactory performance of these approaches.
|
SP-3.2
|
Speech Enhancement Using Nonlinear Microphone Array with Complementary Beamforming
Hiroshi Saruwatari (Graduate School of Engineering, Nagoya University),
Shoji Kajita (Center for Information Media Studies, Nagoya University),
Kazuya Takeda (Graduate School of Engineering, Nagoya University),
Fumitada Itakura (Center for Information Media Studies, Nagoya University)
This paper describes an improved spectral subtraction method by using
the complementary beamforming microphone array to enhance noisy speech
signals for speech recognition. The complementary beamforming is based
on two types of beamformers designed to obtain complementary directivity
patterns with respect to each other. In this paper, it is shown that the
nonlinear subtraction processing with complementary beamforming can result
in a kind of the spectral subtraction without the need for speech pause
detection. In addition, the design of the optimization algorithm for
the directivity pattern is also described. To evaluate the effectiveness,
speech enhancement experiments and speech recognition experiments are
performed based on computer simulations. In comparison with the optimized
conventional delay-and-sum array, it is shown that the proposed array
improves the signal-to-noise ratio of degraded speech by about 2 dB and
performs about 10% better in word recognition rates under heavy noisy
conditions.
|
SP-3.3
|
A Multivariate Speech Activity Detector Based on the Syllable Rate
David C Smith,
Jeffrey Townsend,
Douglas J Nelson,
Dan Richman (U.S. Department of Defense)
Computationally efficient speech extraction algorithms
have significant potential economic benefit, by
automating an extremely tedious manual process.
Previously, algorithms which discriminate between
speech and one specific other signal type have been
developed, and often fail when the specific non-speech
signal is replaced by a different signal type.
Moreover, several such signal specific discriminators
have been combined with predictable negative results.
When the number of discriminating features is large,
compression methods such as Principal Components have
been applied to reduce dimension, even though
information may be lost in the process. In this paper,
graphical tools are applied to determine a set of
features which produce excellent speech vs. non-speech
clustering. This cluster structure provides the basis
for a general speech vs. non-speech discriminator,
which significantly outperforms the TALKATIVE speech
extraction algorithm.
|
SP-3.4
|
Discriminating Speakers With Vocal Nodules Using Aerodynamic And Acoustic Features
Jeff Kuo (Bell Laboratories, Lucent Technologies, Murray Hill, NJ 07974),
Eva B. Holmberg,
Robert E. Hillman (Massachusetts Eye and Ear Infirmary, Boston, MA 02114)
This paper demonstrates that linear discriminant analysis
using aerodynamic and acoustic features is effective in
discriminating speakers with vocal-fold nodules from normal
speakers. Simultaneous aerodynamic and acoustic
measurements of vocal function were taken of 14 women with
bilateral vocal-fold nodules and 12 women with normal voice
production. Features were extracted from the glottal
airflow waveform and peaks in the acoustic spectrum for the
vowel /\ae/. Results show that the subglottal pressure, air
flow, and open quotient are increased in the nodules
group. Estimated first-formant bandwidths are increased, but
result in minimal change in the first-formant amplitudes.
There is no appreciable decrease in high frequency energy.
Speakers with nodules may be compensating for the nodules by
increasing the subglottal pressure, resulting in relatively
good acoustics but increased air flows. The two best
features for discrimination are open quotient and subglottal
pressure.
|
SP-3.5
|
Enhancement of Esophageal Speech Using Formant Synthesis
Kenji Matsui,
Noriyo Hara (Central Research Labs. Matsushita Electric Ind. Co., Ltd.)
The feasibility of using the formant analysis-synthesis approach to replace the voicing
sources of esophageal speech was explored. The voicing sources were generated by
using inverse-filtered signals extracted from normal speakers. Pitch extraction was
tested with various pitch extraction methods, then simple auto-correlation method
was chosen. Special hardware unit was designed to perform the analysis-synthesis
process in real-time. Results of a subjective test showed that the synthesized speech
was significantly improved.
|
SP-3.6
|
Development of Rules for Controlling the HLsyn Speech Synthesizer
Helen M Hanson,
Richard S McGowan,
Kenneth N Stevens (Sensimetrics Corp.),
Robert E Beaudoin (Sensimetrics Corporation)
In this paper we describe the development of rules to drive a
quasi-articulatory speech synthesizer, HLsyn. HLsyn has 13 parameters,
which are mapped to the parameters of a formant synthesizer. Its small
number of parameters combined with the computational simplicity of a
formant synthesizer make it a good basis for a text-to-speech system.
An overview of the rule-driven system, called VHLsyn, is
presented. The system assumes a phonetic string as input, and produces
HLsyn parameter tracks as output. These parameter tracks are then used
by HLsyn to produce synthesized speech. Recent work to improve the
synthesis of consonants and suprasegmental effects is described, and
is shown to improve the quality of the output speech. The improvements
include refinement of release characteristics of stop
consonants, methods for control of vocal-fold parameters
for voiced and voiceless obstruent consonants, and rules for
timing and intonation.
|
SP-3.7
|
On the characteristics and effects of loudness during utterance production in continuous speech recognition
Daniel Tapias,
Carlos Garcia (Telefonica Investigacion y Desarrollo, S.A. Unipersonal),
Christophe Cazassus (ENST (Bretagne))
We have checked out that, in speech recognition based
telephone applications,the loudness with which the speech
signal is produced is a source of degradation of the word
accuracy if it is lower or higher than normal. For this reason,
we have carried out a research work which has reached three
goals: (a) get a better understanding of the Speech Production
Loudness (SPL) phenomenon, (b) find out the parameters of the
speech recognizer that are the most affected by loudness
variations, and (c) compute the effects of SPL and whispery
speech in Large Vocabulary Continuous Speech Recognition
(LVCSR). In this paper we report the results of this study for
three different loudnesses (low, normal and high) and whispery
speech. We also report the word accuracy degradation of a
continuous speech recognition system when the speech
production loudness is different than normal as well as the
degradation for whispery speech. The study has been done
using the TRESVOL Spanish database, that was designed to
study, evaluate and compensate the effect of loudness and
whispery speech in LVCSR systems.
|
SP-3.8
|
A Multi-Channel Speech/Silence Detector based on Time Delay Estimation and Fuzzy Classification
Francesco Beritelli,
Salvatore Casale,
Alfredo Cavallaro (Istituto di Informatica e Telecomunicazioni - University of Catania)
Discontinuous transmission based on speech/pause detection represents a valid solution to improve the spectral efficiency of new-generation wireless communication systems. In this context, robust Voice Activity Detection (VAD) algorithms are required, as traditional solutions present a high misclassification rate in the presence of the background noise typical of mobile environments. The Fuzzy Voice Activity Detector (FVAD) recently proposed in [1], shows that a valid alternative to deal with the problem of activity decision is to use methodologies like fuzzy logic. In this paper we propose a multichannel approach to activity detection using both fuzzy logic and time delay estimation. Objective and subjective tests confirm a significant improvement over traditional methods, above all in terms of a reduction in activity increase for non stationary noise.
|
SP-3.9
|
Noise Suppression Using A Time-Varying, Analysis/Synthesis Gammachirp Filterbank
Toshio Irino (ATR Human Information Processing Research Laboratory)
Spectral subtraction has been cited most often as a noise suppression method for speech signals in steady background noise, because it is basically a non-parametric method and simple enough to implement for various applications using FFT. It has also been well known, however, that spectral subtraction produces so called "musical noise" in synthetic sounds. Since such musical noise, even at low levels, can often bother humans in speech perception, spectral subtraction has not been very successful in signal processing applications for human listeners. To suppress noise without producing musical noise, an alternative method has been developed using a time-varying, analysis/synthesis gammachirp filterbank; this was initially proposed as an auditory filterbank. The present method achieves about the same SNR improvement as spectral subtraction when using the same information on the non-speech interval. Moreover, the synthetic sounds only contain steady white-like noise at reduced levels when the original noise is white. This method is, therefore, advantageous in various applications for human listeners.
|
SP-3.10
|
Experimental Comparison of Signal Subspace Based Noise Reduction Methods
Peter S. K. Hansen,
Per C. Hansen,
Steffen D. Hansen,
John Aa. Sørensen (Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark)
In this paper the signal subspace approach for
nonparametric speech enhancement is considered.
Several algorithms have been proposed in the
literature but only partly analyzed. Here, the
different algorithms are compared, and the emphasis
is put onto the limiting factors and practical
behavior of the estimators. Experimental results show
that the signal subspace approach may lead to a
significant enhancement of the signal to noise ratio
of the output signal.
|
|