Authors:
Chung-Hsien Wu, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (Taiwan)
Jau-Hung Chen, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (Taiwan)
Page (NA) Paper number 1360
Abstract:
In this paper, a template-driven generation of prosodic information
is proposed for Chinese text-to-speech conversion. A set of monosyllable-based
synthesis units is selected from a large continuous speech database.
The speech database is employed to establish a word-prosody-based template
tree according to the linguistic features: tone combination, word length,
part-of-speech (POS) of the word, and word position in a sentence.
This template tree stores the prosodic features including pitch contour,
average energy, and syllable duration of a word for possible combinations
of linguistic features. Two modules for sentence intonation and template
selection are proposed to generate the target prosodic templates. The
experimental results for the TTS conversion system showed that synthesized
prosodic features quite resembled their original counterparts for most
syllables in the inside test. Evaluation by subjective experiments
also confirmed the satisfactory performance of these approaches.
Authors:
Hiroshi Saruwatari,
Shoji Kajita,
Kazuya Takeda,
Fumitada Itakura,
Page (NA) Paper number 1669
Abstract:
This paper describes an improved spectral subtraction method by using
the complementary beamforming microphone array to enhance noisy speech
signals for speech recognition. The complementary beamforming is based
on two types of beamformers designed to obtain complementary directivity
patterns with respect to each other. In this paper, it is shown that
the nonlinear subtraction processing with complementary beamforming
can result in a kind of the spectral subtraction without the need for
speech pause detection. In addition, the design of the optimization
algorithm for the directivity pattern is also described. To evaluate
the effectiveness, speech enhancement experiments and speech recognition
experiments are performed based on computer simulations. In comparison
with the optimized conventional delay-and-sum array, it is shown that
the proposed array improves the signal-to-noise ratio of degraded speech
by about 2 dB and performs about 10% better in word recognition rates
under heavy noisy conditions.
Authors:
David C Smith,
Jeffrey Townsend,
Douglas J Nelson,
Dan Richman,
Page (NA) Paper number 1756
Abstract:
Computationally efficient speech extraction algorithms have significant
potential economic benefit, by automating an extremely tedious manual
process. Previously, algorithms which discriminate between speech and
one specific other signal type have been developed, and often fail
when the specific non-speech signal is replaced by a different signal
type. Moreover, several such signal specific discriminators have been
combined with predictable negative results. When the number of discriminating
features is large, compression methods such as Principal Components
have been applied to reduce dimension, even though information may
be lost in the process. In this paper, graphical tools are applied
to determine a set of features which produce excellent speech vs. non-speech
clustering. This cluster structure provides the basis for a general
speech vs. non-speech discriminator, which significantly outperforms
the TALKATIVE speech extraction algorithm.
Authors:
Jeff Kuo,
Eva B. Holmberg,
Robert E. Hillman,
Page (NA) Paper number 1789
Abstract:
This paper demonstrates that linear discriminant analysis using aerodynamic
and acoustic features is effective in discriminating speakers with
vocal-fold nodules from normal speakers. Simultaneous aerodynamic and
acoustic measurements of vocal function were taken of 14 women with
bilateral vocal-fold nodules and 12 women with normal voice production.
Features were extracted from the glottal airflow waveform and peaks
in the acoustic spectrum for the vowel /æ/. Results show that
the subglottal pressure, air flow, and open quotient are increased
in the nodules group. Estimated first-formant bandwidths are increased,
but result in minimal change in the first-formant amplitudes. There
is no appreciable decrease in high frequency energy. Speakers with
nodules may be compensating for the nodules by increasing the subglottal
pressure, resulting in relatively good acoustics but increased air
flows. The two best features for discrimination are open quotient and
subglottal pressure.
Authors:
Kenji Matsui,
Noriyo Hara,
Page (NA) Paper number 1831
Abstract:
The feasibility of using the formant analysis-synthesis approach to
replace the voicing sources of esophageal speech was explored. The
voicing sources were generated by using inverse-filtered signals extracted
from normal speakers. Pitch extraction was tested with various pitch
extraction methods, then simple auto-correlation method was chosen.
Special hardware unit was designed to perform the analysis-synthesis
process in real-time. Results of a subjective test showed that the
synthesized speech was significantly improved.
Authors:
Helen M Hanson,
Richard S McGowan,
Kenneth N Stevens,
Robert E Beaudoin,
Page (NA) Paper number 2179
Abstract:
In this paper we describe the development of rules to drive a quasi-articulatory
speech synthesizer, HLsyn. HLsyn has 13 parameters, which are mapped
to the parameters of a formant synthesizer. Its small number of parameters
combined with the computational simplicity of a formant synthesizer
make it a good basis for a text-to-speech system. An overview of the
rule-driven system, called VHLsyn, is presented. The system assumes
a phonetic string as input, and produces HLsyn parameter tracks as
output. These parameter tracks are then used by HLsyn to produce synthesized
speech. Recent work to improve the synthesis of consonants and suprasegmental
effects is described, and is shown to improve the quality of the output
speech. The improvements include refinement of release characteristics
of stop consonants, methods for control of vocal-fold parameters for
voiced and voiceless obstruent consonants, and rules for timing and
intonation.
Authors:
Daniel Tapias,
Carlos García,
Christophe Cazassus,
Page (NA) Paper number 2302
Abstract:
We have checked out that, in speech recognition based telephone applications,the
loudness with which the speech signal is produced is a source of degradation
of the word accuracy if it is lower or higher than normal. For this
reason, we have carried out a research work which has reached three
goals: (a) get a better understanding of the Speech Production Loudness
(SPL) phenomenon, (b) find out the parameters of the speech recognizer
that are the most affected by loudness variations, and (c) compute
the effects of SPL and whispery speech in Large Vocabulary Continuous
Speech Recognition (LVCSR). In this paper we report the results of
this study for three different loudnesses (low, normal and high) and
whispery speech. We also report the word accuracy degradation of a
continuous speech recognition system when the speech production loudness
is different than normal as well as the degradation for whispery speech.
The study has been done using the TRESVOL Spanish database, that was
designed to study, evaluate and compensate the effect of loudness and
whispery speech in LVCSR systems.
Authors:
Francesco Beritelli,
Salvatore Casale,
Alfredo Cavallaro,
Page (NA) Paper number 2363
Abstract:
Discontinuous transmission based on speech/pause detection represents
a valid solution to improve the spectral efficiency of new-generation
wireless communication systems. In this context, robust Voice Activity
Detection (VAD) algorithms are required, as traditional solutions present
a high misclassification rate in the presence of the background noise
typical of mobile environments. The Fuzzy Voice Activity Detector (FVAD)
recently proposed in [1], shows that a valid alternative to deal with
the problem of activity decision is to use methodologies like fuzzy
logic. In this paper we propose a multichannel approach to activity
detection using both fuzzy logic and time delay estimation. Objective
and subjective tests confirm a significant improvement over traditional
methods, above all in terms of a reduction in activity increase for
non stationary noise.
Authors:
Toshio Irino,
Page (NA) Paper number 1837
Abstract:
Spectral subtraction has been cited most often as a noise suppression
method for speech signals in steady background noise, because it is
basically a non-parametric method and simple enough to implement for
various applications using FFT. It has also been well known, however,
that spectral subtraction produces so called "musical noise" in synthetic
sounds. Since such musical noise, even at low levels, can often bother
humans in speech perception, spectral subtraction has not been very
successful in signal processing applications for human listeners. To
suppress noise without producing musical noise, an alternative method
has been developed using a time-varying, analysis/synthesis gammachirp
filterbank; this was initially proposed as an auditory filterbank.
The present method achieves about the same SNR improvement as spectral
subtraction when using the same information on the non-speech interval.
Moreover, the synthetic sounds only contain steady white-like noise
at reduced levels when the original noise is white. This method is,
therefore, advantageous in various applications for human listeners.
Authors:
Peter S. K. Hansen, Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark (Denmark)
Per Christian Hansen, Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark (Denmark)
Steffen Duus Hansen, Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark (Denmark)
John Aasted Sørensen, Department of Mathematical Modelling, Technical University of Denmark, Building 321, DK-2800 Lyngby, Denmark (Denmark)
Page (NA) Paper number 1863
Abstract:
In this paper the signal subspace approach for nonparametric speech
enhancement is considered. Several algorithms have been proposed in
the literature but only partly analyzed. Here, the different algorithms
are compared, and the emphasis is put onto the limiting factors and
practical behavior of the estimators. Experimental results show that
the signal subspace approach may lead to a significant enhancement
of the signal to noise ratio of the output signal.
|