Authors:
Boonpramuk Panuthat,
Tetsuo Funada,
Noboru Kanedera,
Page (NA) Paper number 1618
Abstract:
This paper presents a method for speech analysis/synthesis/conversion
by using sequential processing. The aims of this method are to improve
the quality of synthesized speech and to convert the original speech
into another speech of different characteristics. We apply the Kalman
Filter for estimating the auto-regressive coefficients of 'time varying
AR model with unknown input (ARUI model)', which we have proposed to
improve the conventional AR model, and we use a band-pass filter for
making 'a guide signal' to extract the pitch period from the residual
signal. These signals are utilized to make the driving source signal
in speech synthesis. We also use the guide signal for speech conversion,
such as in pitch and utterance length. Moreover, we show experimentally
that this method can analyze/synthesize/convert speech without causing
instability by using the smoothed auto-regressive coefficients.
Authors:
D. Mike Brookes,
Han Pin Loke,
Page (NA) Paper number 1864
Abstract:
The pitch-synchronous analysis that is used in several areas of speech
processing often requires robust detection of the instants of glottal
closure and opening. In this paper we derive expressions for the flow
of acoustic energy in the lossless-tube model of the vocal tract and
show how linear predictive analysis may be used to estimate the waveform
of acoustic input power at the glottis. We demonstrate that this signal
may be used to identify the instants of glottal closure and opening
during voiced speech and contrast it with the LPC residual signal that
previous authors have used for this purpose.
Authors:
Srinivasan Umesh, Indian Institute of Technology (India)
Leon Cohen,
Douglas J Nelson, Dept. of Defense USA (USA)
Page (NA) Paper number 2167
Abstract:
We show that there are many qualitatively different equations, each
with few parameters, that fit the experimentally obtained Mel scale.
We investigate the often made remark that there are two regions to
the Mel scale, the first region ( < ~ 1000 Hz. ) being linear and the
upper region being logarithmic. We show that there is no evidence,
based on the experimental data points, that there are two qualitatively
different regions or that the lower region is linear and upper region
logarithmic. In fact F_M= f/(af +b) where F_M and f are the mel and
physical frequency respectively, fits better then a line in the linear
region or a logarithm in the ``log'' region.
Authors:
Wai Kat Liu,
Pascale Fung,
Page (NA) Paper number 2349
Abstract:
The performance of speech recognition systems degrades when speaker
accent is different from that in the training set. Accent-independent
or accent-dependent recognition both require collection of more training
data. In this paper, we propose a faster accent classification approach
using phoneme-class models. We also present our findings in acoustic
features sensitive to a Cantonese accent, and possibly other Asian
language accents. In addition, we show how we can rapidly transform
a native accent pronunciation dictionary to that for accented speech
by simply using knowledge of the native language of the foreign speaker.
The use of this accent-adapted dictionary reduces recognition error
rate by 13.5%, similar to the results obtained from a longer, data-driven
process.
Authors:
Howard H Yang,
Sarel J Van Vuuren,
Hynek Hermansky,
Page (NA) Paper number 2454
Abstract:
In this paper we use mutual information to study the distribution in
time and frequency of information relevant for phonetic classification.
A large database of hand-labeled fluent speech is used to (a) compute
the mutual information between phoneme labels and a point of logarithmic
energy in the time-frequency plane and (b) compute the joint mutual
information between phoneme labels and two points of logarithmic energy
in the time-frequency plane.
Authors:
Keiichi Tokuda, Nagoya Institute of Technology, Nagoya, Japan (Japan)
Takashi Masuko, Tokyo Institute of Technology, Japan (Japan)
Noboru Miyazaki, NTT Basic Research Laboratories, Japan (Japan)
Takao Kobayashi, Tokyo Institute of Technology, Japan (Japan)
Page (NA) Paper number 2479
Abstract:
This paper discusses a hidden Markov model (HMM) based on multi-space
probability distribution (MSD). The HMMs are widely-used statistical
models to characterize the sequence of speech spectra and have successfully
been applied to speech recognition systems. From these facts, it is
considered that the HMM is useful for modeling pitch patterns of speech.
However, we cannot apply the conventional discrete or continuous HMMs
to pitch pattern modeling since the observation sequence of pitch pattern
is composed of one-dimensional continuous values and a discrete symbol
which represents ``unvoiced''. MSD-HMM includes discrete HMM and continuous
mixture HMM as special cases, and further can model the sequence of
observation vectors with variable dimension including zero-dimensional
observations, i.e., discrete symbols. As a result, MSD-HMMs can model
pitch patterns without heuristic assumption. We derive a reestimation
algorithm for the extended HMM and show that it can find a critical
point of the likelihood function.
Authors:
Ashraf Alkhairy,
Page (NA) Paper number 2492
Abstract:
We present a new method for the estimation of the glottal volume velocity
from voiced segments of the radiated acoustic speech waveform. Our
algorithm is based on spectral factorization of the signal and is a
general purpose procedure. It does not suffer from residual effects
or assume constraining models for the vocal tract and the glottal source,
as is commonly the case with existing methods. The resulting estimate
of the glottal volume velocity is accurate and can be used for modeling
and synthesis purposes.
Authors:
Khaled El-Maleh,
Ara Samouelian,
Peter Kabal,
Page (NA) Paper number 1774
Abstract:
Background environmental noises degrade the performance of speech-processing
systems (e.g. speech coding, speech recognition). By modifying the
processing according to the type of background noise, the performance
can be enhanced. This requires noise classification. In this paper,
four pattern-recognition frameworks have been used to design noise
classification algorithms. Classification is done on a frame-by-frame
basis (e.g. once every 20 ms). Five commonly encountered noises in
mobile telephony (i.e. car, street, babble, factory, and bus) have
been considered in our study. Our experimental results show that the
Line Spectral Frequencies (LSF's) are robust features in distinguishing
the different classes of noises.
|