Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-7 |
|
SP-7.1
|
Speech Analysis/Synthesis/Conversion by Using Sequential Processing
Panuthat Boonpramuk,
Tetsuo Funada (Faculty of Engineering,Kanazawa University),
Noboru Kanedera (Ishikawa National College of Technology)
This paper presents a method for speech analysis/synthesis/conversion by using
sequential processing. The aims of this method are to improve the quality of synthesized
speech and to convert the original speech into another speech of different characteristics.
We apply the Kalman Filter for estimating the auto-regressive coefficients of 'time varying
AR model with unknown input (ARUI model)', which we have proposed to improve the
conventional AR model, and we use a band-pass filter for making 'a guide signal' to
extract the pitch period from the residual signal. These signals are utilized to make
the driving source signal in speech synthesis. We also use the guide signal for
speech conversion, such as in pitch and utterance length. Moreover, we show
experimentally that this method can analyze/synthesize/convert speech without
causing instability by using the smoothed auto-regressive coefficients.
|
SP-7.2
|
Modelling Energy Flow in the Vocal Tract with Applications to Glottal Closure and Opening Detection
Mike Brookes,
Han Pin Loke (Imperial College)
The pitch-synchronous analysis that is used in several areas of speech processing often requires robust detection of the instants of glottal closure and opening. In this paper we derive expressions for the flow of acoustic energy in the lossless-tube model of the vocal tract and show how linear predictive analysis may be used to estimate the waveform of acoustic input power at the glottis. We demonstrate that this signal may be used to identify the instants of glottal closure and opening during voiced speech and contrast it with the LPC residual signal that previous authors have used for this purpose.
|
SP-7.3
|
Fitting the Mel Scale
Srinivasan Umesh (Indian Institute of Technology),
Leon Cohen (City University of New York),
Douglas Nelson (Dept. of Defense USA)
We show that there are many qualitatively different equations,
each with few parameters, that fit the experimentally obtained Mel scale.
We investigate the often made remark that there are two regions to the
Mel scale, the first region ( $< \sim$ 1000 Hz. ) being linear and the upper
region being logarithmic. We show that there is no evidence, based on the
experimental data points, that there are two qualitatively different regions
or that the lower region is linear and upper region logarithmic.
In fact $F_M= f/(af +b)$ where $F_M$ and $f$ are the mel and
physical frequency respectively, fits better then a line in
the linear region or a logarithm in the ``log'' region.
|
SP-7.4
|
Fast Accent Identification and Accented Speech Recognition
Pascale Fung,
Wai Kat LIU (University of Science and Technology (HKUST))
The performance of speech recognition systems degrades when speaker accent is different from that in the training set. Accent-independent or accent-dependent
recognition both require collection of more training data.
In this paper, we propose a faster accent classification approach using phoneme-class models. We also present our findings in acoustic features sensitive
to a Cantonese accent, and possibly other Asian language accents.
In addition, we show how we can rapidly transform a native accent pronunciation
dictionary to that for accented speech by simply using knowledge of the native
language of the foreign speaker.
The use of this accent-adapted dictionary reduces recognition error rate by 13.5\%, similar to the results obtained from a longer, data-driven process.
|
SP-7.5
|
Relevancy of Time-Frequency features for Phonetic Classification measured by Mutual Information
Howard H Yang,
Sarel J Van Vuuren,
Hynek Hermansky (Oregon Graduate Institute of Science and Technology)
In this paper we use mutual information to study the distribution in
time and frequency of information relevant for phonetic
classification. A large database of hand-labeled fluent speech is used
to (a) compute the mutual information between phoneme labels and a
point of logarithmic energy in the time-frequency plane and (b)
compute the joint mutual information between phoneme labels and two
points of logarithmic energy in the time-frequency plane.
|
SP-7.6
|
Hidden Markov Models Based on Multi-Space Probability Distribution for Pitch Pattern Modeling
Keiichi Tokuda (Nagoya Institute of Technology, Nagoya, Japan),
Takashi Masuko (Tokyo Institute of Technology, Japan),
Noboru Miyazaki (NTT Basic Research Laboratories, Japan),
Takao Kobayashi (Tokyo Institute of Technology, Japan)
This paper discusses a hidden Markov model (HMM) based on
multi-space probability distribution (MSD). The HMMs are
widely-used statistical models to characterize the sequence of
speech spectra and have successfully been applied to speech
recognition systems. From these facts, it is considered that
the HMM is useful for modeling pitch patterns of speech.
However, we cannot apply the conventional discrete or
continuous HMMs to pitch pattern modeling since the
observation sequence of pitch pattern is composed of
one-dimensional continuous values and a discrete symbol which
represents ``unvoiced''. MSD-HMM includes discrete HMM and
continuous mixture HMM as special cases, and further can model
the sequence of observation vectors with variable dimension
including zero-dimensional observations, i.e., discrete
symbols. As a result, MSD-HMMs can model pitch patterns
without heuristic assumption. We derive a reestimation
algorithm for the extended HMM and show that it can find a
critical point of the likelihood function.
|
SP-7.7
|
An Algorithm for Glottal Volume Velocity Estimation
Ashraf Alkhairy (M. I. T.)
We present a new method for the estimation of
the glottal volume velocity from voiced segments of
the radiated acoustic speech waveform. Our algorithm is
based on spectral factorization of the signal and
is a general purpose procedure. It does not suffer from
residual effects or assume constraining models for
the vocal tract and the glottal source, as is commonly
the case with existing methods. The resulting estimate
of the glottal volume velocity is accurate and can be
used for modeling and synthesis purposes.
|
SP-7.8
|
Frame-Level Noise Classification in Mobile Environments
Khaled El-Maleh (Electrical and Computer Engineering Dept., McGill University),
Ara Samouelian (School of Elect., Comp. and Telecomm. Eng., University of Wollongong),
Peter Kabal (Electrical and Computer Engineering Dept., McGill University)
Background environmental noises degrade the performance
of speech-processing systems (e.g. speech coding,
speech recognition). By modifying the processing
according to the type of background noise, the
performance can be enhanced. This requires noise
classification. In this paper, four pattern-recognition
frameworks have been used to design noise
classification algorithms. Classification is done on
a frame-by-frame basis (e.g. once every 20 ms). Five
commonly encountered noises in mobile telephony
(i.e. car, street, babble, factory, and bus) have been
considered in our study. Our experimental results show
that the Line Spectral Frequencies (LSF's) are
robust features in distinguishing the different
classes of noises.
|
|