Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-14 |
|
SP-14.1
|
Time-Varying Noise Compensation Using Multiple Kalman Filters
Nam Soo Kim (School of Electrical Engineering, Seoul National University)
The environmental conditions in which a speech recognition system should be operating are usually nonstationary. We present an approach to compensate for the effects of time-varying noise using a bank of Kalman filters. The presented method is based on the interacting multiple model (IMM) technique well-known in the area of multiple target tracking. Moreover, we propose a way to get fixed-interval smoothed estimates
for the environmental parameters. The performances of the proposed approaches are evaluated in the continuous
digit recognition experiments where not only the slowly evolving noise but also the rapidly varying noise sources are added to simulate the noisy environments.
|
SP-14.2
|
A Segment-based C0 Adaptation Scheme for PMC-based Noisy Mandarin Speech Recognition
Wei-Tyng Hong,
Sin-Horng Chen (Dep. of Communication Eng., National Chiao Tung University)
A segment-based C0 (the zero-th order of cepstral
coefficient) adaptation scheme for PMC-based Mandarin
speech recognition is proposed in this paper. It
incorporates a new C0 model of speech signal into the
PMC method to improve the gain matching between the
clean-speech HMM models and the current noise model.
The C0 model is constructed in the training phase by
jointly modeling the normalized C0 with other MFCC
recognition features to form C0-normalized HMM models.
In the testing phase, it pre-segments the input
utterance into syllable-like segments, performs
C0-denormaliztion operations to expand the
C0-normalized HMM models, and uses them in the PMC
method. Compared with the conventional PMC method,
the proposed method can achieve a much better noise
compensation effect due to the use of more precise
gain matching in the PMC model combination.
Experimental results showed that the base-syllable
accuracy rate was significantly upgraded for continuous
noisy Mandarin speech recognition.
|
SP-14.3
|
Improved Parallel Model Combination Techniques With Split Gaussian Mixtures For Speech Recognition Under Noisy Conditions
Jeih-weih Hung (Dept of Electrical Engineering, National Taiwan University),
Jia-lin Shen (Institute of Information Science, Academia Sinica),
Lin-shan Lee (Dept of Electrical Engineering, National Taiwan University)
The parallel model combination (PMC) technique
has been very successful and frequently used
to improve the performance of a speech recognition
system under noisy environments. In this approach
it is assumed that the log spectrum of speech
signals is Gaussian-distributed, which is not
always valid especially when the number of
mixtures in the HMM's is few. In this paper,
a simple approach is proposed to improve the
PMC method by splitting the mixtures before
the domain transformation process in PMC is
performed, and merging the mixtures back to the
original number after the PMC processes are
completed. Preliminary experimental results
show that the increased number of mixtures
during the PMC processes can in fact provide
significant improvements over the original
PMC method in terms of the recognition
accuracies, especially when the SNR is low.
|
SP-14.4
|
Speech Recognition and Enhancement by A Nonstationary AR HMM with Gain Adaptation under Unknown Noise
Ki Yong Lee (School of Electronic Engineering, Soongsil University, 1-1 Sangdo-5Dong, Dongjak-Ku, Seoul, 156-743 Korea),
Joohun Lee (Dept. of Information and Telecommunication, Dong-Ah College, Ansung, Korea),
Gunther Ruske (Inst. for Human-Machine-Communication, Munich University of Technology, Germany)
In this paper, a gain-adapted speech recognition in unknown noise is developed in time domain. The noise is assumed to be the colored noise. The nonstationary autoregressive (NAR) hidden markov model (HMM) used to model clean speeches. The nonstationary AR is modeled by polynomial functions with a linear combination of M known basis functions. Enhancement using multiple Kalman filters is performed for the gain contour of speech and estimation of noise model when only the noisy signal is available.
|
SP-14.5
|
Database and Online Adaptation for improved Speech Recognition in Car Environments
Alexander Fischer (Philips Research Laboratories, Aachen, Germany),
Volker Stahl (Philips Research Laboratories Aachen, Germany)
Data collections in the car environment require much more effort
in terms of cost and time as compared to the telephone or the office
environment. Therefore we apply supervised database adaptation from
the telephone environment to the car environment to allow quick setup
of car environment recognizers. Further reduction of word error rate
is obtained by unsupervised online adaptation during recognition.
We investigate the common techniques MLLR and MAP for that purpose.
We give results on command word recognition in the car environment for
all combinations of database and online adaptation in task-dependent and
task-independent scenarios. The possibility of setting up speech
recognizers for the car environment based on telephone data and a limited
amount of adaptation material from the car environment is demonstrated.
|
SP-14.6
|
Training of HMM with Filtered Speech Material for Hands-free Recognition
Diego Giuliani,
Marco Matassoni,
Maurizio Omologo,
Piergiorgio Svaizer (ITC-IRST Centro per la Ricerca Scientifica e Tecnologica)
This paper addresses the problem of hands-free speech recognition in a
noisy office environment. An array of six omnidirectional microphones
and a corresponding time delay compensation module are used to provide
a beamformed signal as input to a HMM-based recognizer.
Training of HMMs is performed either using a clean speech database or
using a filtered version of the same database. The filtering consists
in a convolution with the acoustic impulse response between speaker
and microphone, to reproduce the reverberation effect. Background
noise is summed to provide the desired SNR. The paper shows that the
new models trained on these data perform better than the baseline
ones.
Furthermore, the paper investigates on MLLR adaptation of the new
models. It is shown that a further performance improvement is
obtained, allowing to reach a 98.7% WRR in a connected digit
recognition task, when the talker is at 1.5 m distance from the array.
|
SP-14.7
|
Incremental Enrollment of Speech Recognizers
Chafic E Mokbel (France Telecom - CNET - DIH/DIPS (Currently at IDIAP)),
Olivier Collin (France Telecom - CNET - DIH/DIPS)
Classical adaptation approaches generally allow a reliably
trained model to match a particular condition. In this paper,
we define an incremental version of the segmental-EM algorithm.
This method permits to incrementally enrich a model first
trained with limited amount of data. Resource memory
constraints allow only the initial data statistics to
be stored. The proposed method uses these statistics
by fixing, within the segmental EM algorithm applied
on both initial and new data, the initial optimal paths
in the model for the initial data. We proved theoretically
that this is equivalent to the segmental MAP adaptation
with specific choice of priors. Experimented on two speaker
dependent telephone databases, the approach permitted to
incrementally integrate new conditions of use. The
performance was slightly less than that obtained with
classical training over the whole data. As expected
with the MAP interpretation of the algorithm, initial
data characteristics influence largely the model evolution.
|
SP-14.8
|
AUTOMATIC SPEECH RECOGNITION: A COMMUNICATION PERSPECTIVE
Bishnu S Atal (AT&T Labs, Florham Park, NJ 07932, USA)
Speech recognition is usually regarded as a problem in the field of pattern recognition, where one first estimates the probability
density function of each pattern to be recognized and then uses Bayes theorem to identify the pattern which provides the highest
likelihood for the observed speech data. In this paper, we will take a different approach to this problem. In speech recognition,
the goal is communication of information by voice and we will discuss the basics of speech recognition from a communication
perspective. The speech signal at the acoustic level has a bit rate of 64 kb/s but the underlying sound patterns have an
information rate of less than 100 b/s. What is the role of this high bit rate at the acoustic level? We will discuss the
principles of decoding patterns that are submerged in an ocean of seemingly irrelevant information.
|
|