Authors:
Yukikuni Nishida,
Yoshio Nakadai,
Yoshitake Suzuki,
Toshihide Kurokawa,
Hirokazu Sato,
Tetsuma Sakurai,
Page (NA) Paper number 1957
Abstract:
This paper describes a smart recognition system which performs character
matching by replacing speech parameters with the five Japanese vowels
and a few consonant categories. The proposed algorithm can make speaker-independent
voice recognition. The algorithm has an advantage over the conventional
speaker-independent word recognition system because it can reduce the
required memory to about 0.5% of the conventional algorithm for storing
the reference templates and for the instruction set, and can be performed
even in a low-speed processor. We implemented this recognition algorithm
in a fixed-point, 20-MIPS digital signal processor board with a 9k
x 16-bit on-chip RAM. Recognition experiments using 20 Japanese city
names had a 90.3% accuracy. Such an accuracy is good enough for a voice
control system.
Authors:
Makoto Shozakai,
Page (NA) Paper number 1386
Abstract:
A user-friendly speech interface for car applications is highly needed
for safety reasons. This paper will describe a speech interface VLSI
designed for car environments, with speech recognition and speech compression/decompression
functions. The chip has a heterogeneous architecture composed of ADC/DAC,
DSP, RISC, hard-wired logic and peripheral circuits. The DSP not only
executes acoustic analysis and output probability calculation of HMMs
for speech recognition, but also does speech compression/ decompression.
On the other hand, the RISC works as a CPU of the whole chip and Viterbi
decoder with an aid of hard-wired logic. An algorithm to recognize
a mixed vocabulary of speaker-independent fixed words and speaker-dependent
user-defined words in a seamless way is proposed. It is based on acoustic
event HMMs which enable a template creation from one sample utterance.
The proposed algorithm embedded in the chip is evaluated. Promising
results of the algorithm for multiple languages are shown.
Authors:
Stephen W Anderson,
Natalie Liberman,
Erica Bernstein,
Stephen Foster,
Erin Cate,
Brenda Levin,
Randy Hudson,
Page (NA) Paper number 2060
Abstract:
We have collected a corpus of 78 hours of speech from 297 elderly speakers,
with an average age of 79. We find that acoustic models built from
elderly speech provide much better recognition than do non-elderly
models (42.1 vs. 54.6% WER). We also find that elderly men have substantially
higher word error rates than elderly women (typically 14% absolute).
We report on other experiments with this corpus, dividing the speakers
by age, by gender, and by regional accent. Using the resulting "elderly
acoustic model", we built a document-retrieval program that can be
operated by voice or typing. After usability tests with 110 speakers,
we tested the final system on 37 elderly speakers. Each retrieved 4
documents from a database of 86,190 Boston Globe articles, 2 by typing
and 2 by speech. We measured how quickly they retrieved each article,
and how much help they required. We find no difference between spoken
and typed queries in either retrieval times or in amount of help required,
regardless of age, gender, or computer experience. However, users perceive
speech to be substantially faster, and overwhelmingly prefer speech
to typing.
Authors:
Michael J Carey,
Eluned S Parris,
Harvey Lloyd-Thomas,
Page (NA) Paper number 1432
Abstract:
Several approaches have previously been taken to the problem of discriminating
between speech and music signals. These have used different features
as the input to the classifier and have tested and trained on different
material. In this paper we examine the discrimination achieved by several
different features using common training and test sets and the same
classifier. The database assembled for these tests includes speech
from thirteen languages and music from all over the world. In each
case the distributions in the feature space were modelled by a Gaussian
mixture model. Experiments were carried out on four types of feature,
amplitude, cepstra, pitch and zero-crossings. In each case the derivative
of the feature was also used and found to improve performance. The
best performance resulted from using the cepstra and delta cepstra
which gave an equal error rate (EER) of 1.2%. This was closely followed
by normalised amplitude and delta amplitude. This however used a much
less complex model. The pitch and delta pitch gave an EER of 4% which
was better than the zero-crossing which produced an EER of 6%.
Authors:
Mazin G Rahim,
Page (NA) Paper number 2011
Abstract:
This paper addresses the general problem of connected digit recognition
in the telecommunication environment. In particular, we focus on a
task of recognizing digits when embedded in a natural spoken dialog.
Two different design strategies are investigated: keyword detection
or word spotting, and large-vocabulary continuous speech recognition.
We will characterize the potential benefits and describe the main components
of each design method, including acoustic and language modeling, training
and utterance verification. Experimental results on a subset of a database
that includes customers responses to the open-ended prompt ``How may
I help you?'' are presented.
Authors:
Dong-Suk Yuk,
James L Flanagan,
Page (NA) Paper number 1872
Abstract:
The performance of well-trained speech recognizers using high quality
full bandwidth speech data is usually degraded when used in real world
environments. In particular, telephone speech recognition is extremely
difficult due to the limited bandwidth of transmission channels. In
this paper, neural network based adaptation methods are applied to
telephone speech recognition and a new unsupervised model adaptation
method is proposed. The advantage of the neural network based approach
is that the retraining of speech recognizers for telephone speech is
avoided. Furthermore, because the multi-layer neural network is able
to compute nonlinear functions, it can accommodate for the nonlinear
mapping between full bandwidth speech and telephone speech. The new
unsupervised model adaptation method does not require transcriptions
and can be used with the neural networks. Experimental results on TIMIT/NTIMIT
corpora show that the performance of the proposed methods is comparable
to that of recognizers retrained on telephone speech.
Authors:
Ji Ming,
Philip Hanna,
Darryl Stewart,
Marie Ownes,
F. Jack Smith,
Page (NA) Paper number 1259
Abstract:
Most current speech recognition systems are built upon a single type
of model, e.g. an HMM or certain type of segment based model, and furthermore
typically employs only one type of acoustic feature e.g. MFCCs and
their variants. This entails that the system may not be robust should
the modeling assumptions be violated. Recent research efforts have
investigated the use of multi-scale/multi-band acoustic features for
robust speech recognition. This paper described a multi-model approach
as an alternative and complement to the multi-feature approaches. The
multi-model approach seeks a combination of different types of acoustic
model, thereby integrating the capabilities of each individual model
for capturing discriminative information. An example system built upon
the combination of the standard HMM technique with a segment-based
modeling technique was implemented. Experiments for both isolated-word
and continuous speech recognition have shown improved performances
over each of the individual models considered in isolation.
Authors:
C.S. Ramalingam,
Yifan Gong,
Lorin P Netsch,
Wallace W Anderson,
John J. Godfrey,
Yu-Hung Kao,
Page (NA) Paper number 1780
Abstract:
In this paper we describe a system for name dialing in the car and
present results under three driving conditions using real-life data.
The names are enrolled in the parked car condition (engine off) and
we describe two approaches for endpointing them---energy-based and
recognition-based schemes---which result in word-based and phone-based
models, respectively. We outline a simple algorithm to reject out-of-vocabulary
names. PMC is used for noise compensation. When tested on an internally
collected twenty-speaker database, for a list size of 50 and a hand-held
microphone, the performance averaged over all driving conditions and
speakers was 98%/92% (IV accuracy/OOV rejection); for the hands-free
data, it was 98%/80%.
Authors:
Qing Guo,
Fang Zheng,
Jian Wu,
Wenhu Wu,
Page (NA) Paper number 1172
Abstract:
In this paper we present a novel method to incorporate temporal correlation
into a speech recognition system based on conventional hidden Markov
model (HMM). In our new model the probability of the current observation
not only depends on the current state but also depends on the previous
state and the previous observation. The joint conditional PD is approximated
by non-linear estimation method. As a result, we can still use mixture
Gaussian density to represent the joint conditional PD for the principle
of any PD can be approximated by mixture Gaussian density. The HMM
incorporated temporal correlation by non-linear estimation method,
which we called it FC HMM does not need any additional parameters and
it only brings a little additional computing quantity. The results
in the experiment show that the top 1 recognition rate of FC HMM has
been raised by 6 percent compared to the conventional HMM method.
Authors:
Patrick Nguyen,
Philippe Gelin,
Jean-Claude Junqua,
Jen-Tzung Chien, Depart. of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan (Taiwan)
Page (NA) Paper number 2084
Abstract:
In this paper, a new set of techniques exploiting N-best hypotheses
in supervised and unsupervised adaptation are presented. These techniques
combine statistics extracted from the N-best hypotheses with a weight
derived from a likelihood ratio confidence measure. In the case of
supervised adaptation the knowledge of the correct string is used to
perform N-best based corrective adaptation. Experiments run for continuous
letter recognition recorded in a car environment show that weighting
N-best sequences by a likelihood ratio confidence measure provides
only marginal improvement as compared to 1-best unsupervised adaptation
and N-best unsupervised adaptation with equal weighting. However, an
N-best based supervised corrective adaptation method weighting correct
letters positively and incorrect letters negatively resulted in a 13%
decrease of the error rate as compared with supervised adaptation.
The largest improvement was obtained for non- native speakers.
|