Authors:
Firas Jabloun, INRS-telecommunications (Canada) (Canada)
Ahmet Enis Çetin, Bilkent University (Turkey) (U.K.)
Page (NA) Paper number 1013
Abstract:
In this paper, a new set of speech feature parameters based on multirate
signal processing and the Teager Energy Operator is developed. The
speech signal is first divided into nonuniform subbands in mel-scale
using a multirate filter-bank, then the Teager energies of the subsignals
are estimated. Finally, the feature vector is constructed by log-compression
and inverse DCT computation. The new feature parameters have a robust
speech recognition performance in car engine noise which is low pass
in nature.
Authors:
Ascensión Gallardo-Antolín,
Fernando Díaz-de-María,
Francisco Valverde-Albacete,
Page (NA) Paper number 1443
Abstract:
In this paper, we have extended our previous research on a new approach
to ASR in the GSM environment. Instead of recognizing from the decoded
speech signal, our system works from the digital speech representation
used by the GSM encoder. We have compared the performance of a conventional
system and the one we propose on a speaker independent, isolated- digit
ASR task. For the half and full-rate GSM codecs, from our results,
we conclude that the proposed approach is much more effective in coping
with the coding distortion and transmission errors. Furthermore, in
clean speech conditions, our approach does not impoverish the recognition
performance, even recognizing from GSM digital speech, in comparison
with a conventional system working on unencoded speech.
Authors:
Mike Peters, BMW AG Research and Development 80788 Munich, Germany (Germany)
Page (NA) Paper number 1874
Abstract:
A two channel approach to noise robust feature extraction for speech
recognition in the car is proposed. The coherence function within the
Bark subbands of the MFCC Transform is calculated to estimate the spectral
similarity of two statistic processes. It is illustrated how the coherence
of speech in binaural signals is used to increase the robustness against
incoherent noise. The introduced preprocessing of nonstationary signals
in two microphones results in an additive correction term of the Mel-Frequency-Cepstral-Coeeficients.
Authors:
Satoru Tsuge, ATR-ITL, JAPAN (Japan)
Toshiaki Fukada, ATR-ITL, JAPAN (Japan)
Harald Singer, ATR-ITL, JAPAN (Japan)
Page (NA) Paper number 1686
Abstract:
This paper proposes speaker normalized spectral subband centroids (SSCs)
as supplementary features in noise environment speech recognition.
SSCs are computed as frequency centroids for each subband from the
power spectrum of the speech signal. Since the conventional SSCs depend
on formant frequencies of a speaker, we introduce a speaker normalization
technique into SSC computation to reduce the speaker variability. Experimental
results on spontaneous speech recognition show that the speaker normalized
SSCs are more useful as supplementary features for improving the recognition
performance than the conventional SSCs.
Authors:
Hynek Hermansky,
Sangita Sharma,
Page (NA) Paper number 2427
Abstract:
In this paper we study a new approach to processing temporal information
for automatic speech recognition (ASR). Specifically, we study the
use of rather long-time TempoRAl Patterns (TRAPs) of spectral energies
in place of the conventional spectral patterns for ASR. The proposed
Neural TRAPs are found to yield significant amount of complementary
information to that of the conventional spectral feature based ASR
system. A combination of these two ASR systems is shown to result in
improved robustness to several types of additive and convolutive environmental
degradations. ~
Authors:
Montri Karnjanadecha,
Stephen A Zahorian,
Page (NA) Paper number 2036
Abstract:
This paper presents speech signal modeling techniques which are well
suited to high performance and robust isolated word recognition. Speech
is encoded by a discrete cosine transform of its spectra, after several
preprocessing steps. Temporal information is then also explicitly encoded
into the feature set. We present a new technique for incorporating
this temporal information as a function of temporal position within
each word. We tested features computed with this method using an alphabet
recognition task based on the ISOLET database. The HTK toolkit was
used to implement the isolated word recognizer with whole word HMM
models. The best result obtained based on 50 features and speaker independent
alphabet recognition was 98.0%. Gaussian noise was added to the original
speech to simulate a noisy environment. We achieved a recognition accuracy
of 95.8% at a SNR of 15 dB. We also tested our recognizer with simulated
telephone quality speech by adding noise and band limiting the original
speech. For this "telephone" speech, our recognizer achieved 89.6%
recognition accuracy. The recognizer was also tested in a speaker dependent
mode, resulting in 97.4% accuracy on test data.
Authors:
Yifan Gong,
John J. Godfrey,
Page (NA) Paper number 1721
Abstract:
In the absence of HMMs trained with speech collected in the target
environment, one may use HMMs trained with a large amount of speech
collected in another recording condition (e.g., quiet office, with
high quality microphone.) However, this may result in poor performance
because of the mismatch between the two acoustic conditions. We propose
a linear regression-based model adaptation procedure to reduce such
a mismatch. With some adaptation utterances collected for the target
environment, the procedure transforms the HMMs trained in a quiet condition
to maximize the likelihood of observing the adaptation utterances.
The transformation must be designed to maintain speaker-independence
of the HMM. Our speaker-independent test results show that with this
procedure about 1% digit error rate can be achieved for hands-free
recognition, using target environment speech from only 20 speakers
Authors:
Shuen Kong Wong,
Bertram Shi,
Page (NA) Paper number 2228
Abstract:
We present a non-linear model transformation for adapting Gaussian
Mixture HMMs using both static and dynamic MFCC observation vectors
to additive noise and constant system tilt. This transformation depends
upon a few compensation coefficients which can be estimated from channel
distorted speech via Maximum-Likelihood stochastic matching. Experimental
results validate the effectiveness of the adaptation. We also provide
an adaptation strategy which can result in improved performance at
reduced computational cost compared with a straightforward implementation
of stochastic matching.
|