Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-9 |
|
SP-9.1
|
The Teager Energy Based Feature Parameters for Robust Speech Recognition in Car Noise
Firas Jabloun (INRS-telecommunications (Canada)),
Enis A Cetin (Bilkent University (Turkey))
In this paper, a new set of speech feature parameters based on multirate signal processing and the Teager Energy Operator is developed.
The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filter-bank,
then the Teager energies of the subsignals are estimated.
Finally, the feature vector is constructed by log-compression and inverse DCT computation.
The new feature parameters have a robust speech recognition performance in car engine noise which is low pass in nature.
|
SP-9.2
|
AVOIDING DISTORTIONS DUE TO SPEECH CODING AND TRANSMISSION ERRORS IN GSM ASR TASKS
Ascensión Gallardo-Antolín,
Fernando Díaz-de-María,
Francisco Valverde-Albacete (Departamento de Tecnologías de las Comunicaciones. EPS-Universidad Carlos III de Madrid)
In this paper, we have extended our previous research
on a new approach to ASR in the GSM environment.
Instead of recognizing from the decoded speech signal,
our system works from the digital speech representation
used by the GSM encoder.
We have compared the performance of a conventional
system and the one we propose on a speaker independent,
isolated- digit ASR task. For the half and full-rate
GSM codecs, from our results, we conclude that the
proposed approach is much more effective in coping
with the coding distortion and transmission errors.
Furthermore, in clean speech conditions, our approach
does not impoverish the recognition performance, even
recognizing from GSM digital speech, in comparison
with a conventional system working on unencoded
speech.
|
SP-9.3
|
Binaural Bark subband Preprocessing of nonstationary Signals for noise robust Speech Feature Extraction
Mike Peters (BMW AG Research and Development 80788 Munich, Germany)
A two channel approach to noise robust feature extraction for speech recognition in the car is proposed. The coherence function within the Bark subbands of the MFCC Transform is calculated to estimate the spectral similarity of two statistic processes. It is illustrated how the coherence of speech in binaural signals is used to increase the robustness against incoherent noise. The introduced preprocessing of nonstationary signals in two microphones results in an additive correction term of the Mel-Frequency-Cepstral-Coeeficients.
|
SP-9.4
|
Speaker Normalized Spectral Subband Parameters for Noise Robust Speech Recognition
Satoru Tsuge,
Toshiaki Fukada,
Harald Singer (ATR-ITL, JAPAN)
This paper proposes speaker normalized spectral subband centroids (SSCs) as
supplementary features in noise environment speech recognition. SSCs are
computed as frequency centroids for each subband from the power spectrum of the
speech signal. Since the conventional SSCs depend on formant frequencies of a
speaker, we introduce a speaker normalization technique into SSC computation to
reduce the speaker variability. Experimental results on spontaneous speech
recognition show that the speaker normalized SSCs are more useful as
supplementary features for improving the recognition performance than the
conventional SSCs.
|
SP-9.5
|
TempoRAl Patterns (TRAPs) In ASR Of Noisy Speech
Hynek Hermansky,
Sangita Sharma (Oregon Graduate Institute of Science and Technology)
In this paper we study a new approach to processing
temporal information for automatic speech recognition
(ASR). Specifically, we study the use of rather long-time
TempoRAl Patterns (TRAPs) of spectral energies in place of
the conventional spectral patterns for ASR. The proposed
Neural TRAPs are found to yield significant amount of
complementary information to that of the conventional
spectral feature based ASR system. A combination of these
two ASR systems is shown to result in improved robustness
to several types of additive and convolutive environmental
degradations.
~
|
SP-9.6
|
Signal Modeling for Isolated Word Recognition
Montri Karnjanadecha,
Stephen A Zahorian (Old Dominion University)
This paper presents speech signal modeling techniques which are well suited to high performance and robust isolated word recognition. Speech is encoded by a discrete cosine transform of its spectra, after several preprocessing steps. Temporal information is then also explicitly encoded into the feature set. We present a new technique for incorporating this temporal information as a function of temporal position within each word. We tested features computed with this method using an alphabet recognition task based on the ISOLET database. The HTK toolkit was used to implement the isolated word recognizer with whole word HMM models. The best result obtained based on 50 features and speaker independent alphabet recognition was 98.0%. Gaussian noise was added to the original speech to simulate a noisy environment. We achieved a recognition accuracy of 95.8% at a SNR of 15 dB. We also tested our recognizer with simulated telephone quality speech by adding noise and band limiting the original speech. For this "telephone" speech, our recognizer achieved 89.6% recognition accuracy. The recognizer was also tested in a speaker dependent mode, resulting in 97.4% accuracy on test data.
|
SP-9.7
|
Transforming HMMs For Speaker-Independent Hands-Free Speech Recognition in the Car
Y. Gong,
John J. Godfrey (Texas Instruments Inc.)
In the absence of HMMs trained with speech collected in
the target environment, one may use HMMs trained with a
large amount of speech collected in another recording
condition (e.g., quiet office, with high quality
microphone.) However, this may result in poor
performance because of the mismatch between the two
acoustic conditions.
We propose a linear regression-based model adaptation
procedure to reduce such a mismatch. With some
adaptation utterances collected for the target
environment, the procedure transforms the HMMs trained
in a quiet condition to maximize the likelihood of
observing the adaptation utterances. The transformation
must be designed to maintain speaker-independence of
the HMM.
Our speaker-independent test results show that with
this procedure about 1% digit error rate can be
achieved for hands-free recognition, using target
environment speech from only 20 speakers
|
SP-9.8
|
Channel and Noise Adaptation via HMM Mixture Mean Transform and Stochastic Matching
Shuen Kong Wong (Department of EEE/ HK Univ. of Sci. and Tech.),
Bertram Shi (Department of EEE/HK Univ. of Sci. and Tech.)
We present a non-linear model transformation for adapting Gaussian
Mixture HMMs using both static and dynamic MFCC observation vectors to
additive noise and constant system tilt. This transformation depends
upon a few compensation coefficients which can be estimated from
channel distorted speech via Maximum-Likelihood stochastic matching.
Experimental results validate the effectiveness of the adaptation. We
also provide an adaptation strategy which can result in improved
performance at reduced computational cost compared with a
straightforward implementation of stochastic matching.
|
|