Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-5 |
|
SP-5.1
|
VOICE RECOGNITION FOCUSING ON VOWEL STRINGS ON A FIXED-POINT 20-MIPS DSP BOARD
Yukikuni NISHIDA,
Yoshio NAKADAI,
Yoshitake SUZUKI (Nippon Telegraph and Telephone Corporation),
Toshihide KUROKAWA,
Hirokazu SATO (NTT Advanced Technology Corporation),
Tetsuma SAKURAI (Nippon Telegraph and Telephone Corporation)
This paper describes a smart recognition system which performs character matching by replacing speech parameters with the five Japanese vowels and a few consonant categories. The proposed algorithm can make speaker-independent voice recognition. The algorithm has an advantage over the conventional speaker-independent word recognition system because it can reduce the required memory to about 0.5% of the conventional algorithm for storing the reference templates and for the instruction set, and can be performed even in a low-speed processor. We implemented this recognition algorithm in a fixed-point, 20-MIPS digital signal processor board with a 9k x 16-bit on-chip RAM. Recognition experiments using 20 Japanese city names had a 90.3% accuracy. Such an accuracy is good enough for a voice control system.
|
SP-5.2
|
Speech Interface VLSI for Car Applications
Makoto Shozakai (Asahi Chemical Industry Co., Ltd., LSI Labs.)
A user-friendly speech interface for car applications
is highly needed for safety reasons. This paper will
describe a speech interface VLSI designed for car
environments, with speech recognition and speech
compression/decompression functions. The chip has a
heterogeneous architecture composed of ADC/DAC, DSP,
RISC, hard-wired logic and peripheral circuits. The
DSP not only executes acoustic analysis and output
probability calculation of HMMs for speech
recognition, but also does speech compression/
decompression. On the other hand, the RISC works as
a CPU of the whole chip and Viterbi decoder with an
aid of hard-wired logic. An algorithm to recognize a
mixed vocabulary of speaker-independent fixed words
and speaker-dependent user-defined words in a seamless
way is proposed. It is based on acoustic event HMMs
which enable a template creation from one sample
utterance. The proposed algorithm embedded in the
chip is evaluated. Promising results of the algorithm
for multiple languages are shown.
|
SP-5.3
|
Recognition of Elderly Speech and Voice-Driven Document Retrieval
Stephen W Anderson,
Natalie Liberman,
Erica Bernstein,
Stephen Foster,
Erin Cate,
Brenda Levin (Dragon Systems),
Randy Hudson (AverStar)
We have collected a corpus of 78 hours of speech from 297 elderly speakers, with an average age of 79. We find that acoustic models built from elderly speech provide much better recognition than do non-elderly models (42.1 vs. 54.6% WER). We also find that elderly men have substantially higher word error rates than elderly women (typically 14% absolute). We report on other experiments with this corpus, dividing the speakers by age, by gender, and by regional accent.
Using the resulting "elderly acoustic model", we built a document-retrieval program that can be operated by voice or typing. After usability tests with 110 speakers, we tested the final system on 37 elderly speakers. Each retrieved 4 documents from a database of 86,190 Boston Globe articles, 2 by typing and 2 by speech. We measured how quickly they retrieved each article, and how much help they required. We find no difference between spoken and typed queries in either retrieval times or in amount of help required, regardless of age, gender, or computer experience. However, users perceive speech to be substantially faster, and overwhelmingly prefer speech to typing.
|
SP-5.4
|
A Comparison of Features for Speech, Music Discrimination
Michael J Carey,
Eluned S Parris,
Harvey Lloyd-Thomas (Ensigma Ltd)
Several approaches have previously been taken to the problem of discriminating between speech and music signals. These have used different features as the input to the classifier and have tested and trained on different material. In this paper we examine the discrimination achieved by several different features using common training and test sets and the same classifier. The database assembled for these tests includes speech from thirteen languages and music from all over the world. In each case the distributions in the feature space were modelled by a Gaussian mixture model. Experiments were carried out on four types of feature, amplitude, cepstra, pitch and zero-crossings. In each case the derivative of the feature was also used and found to improve performance. The best performance resulted from using the cepstra and delta cepstra which gave an equal error rate (EER) of 1.2%. This was closely followed by normalised amplitude and delta amplitude. This however used a much less complex model. The pitch and delta pitch gave an EER of 4% which was better than the zero-crossing which produced an EER of 6%.
|
SP-5.5
|
Recognizing Connected Digits in a Natural Spoken Dialog
Mazin G Rahim (AT&T Labs - Research)
This paper addresses the general problem of connected digit recognition
in the telecommunication environment. In particular, we focus on a
task of recognizing digits when embedded in a natural spoken dialog.
Two different design strategies are investigated: keyword detection or word spotting,
and large-vocabulary continuous speech recognition.
We will characterize the potential benefits and describe the main components of each
design method, including acoustic and language modeling, training and utterance verification.
Experimental results on a subset of a database that includes customers responses to
the open-ended prompt ``How may I help you?'' are presented.
|
SP-5.6
|
Telephone Speech Recognition Using Neural Networks and Hidden Markov Models
DongSuk Yuk (Department of Computer Science, Rutgers University),
James L Flanagan (CAIP center, Rutgers University)
The performance of well-trained speech recognizers using high quality
full bandwidth speech data is usually degraded when used in real world
environments. In particular, telephone speech recognition is
extremely difficult due to the limited bandwidth of transmission
channels. In this paper, neural network based adaptation methods are
applied to telephone speech recognition and a new unsupervised model
adaptation method is proposed. The advantage of the neural network
based approach is that the retraining of speech recognizers for
telephone speech is avoided. Furthermore, because the multi-layer
neural network is able to compute nonlinear functions, it can
accommodate for the nonlinear mapping between full bandwidth speech
and telephone speech. The new unsupervised model adaptation method
does not require transcriptions and can be used with the neural
networks. Experimental results on TIMIT/NTIMIT corpora show that the
performance of the proposed methods is comparable to that of
recognizers retrained on telephone speech.
|
SP-5.7
|
Improving Speech Recognition Performance By Using Multi-Model Approaches
Ji Ming,
Philip Hanna,
Darryl Stewart,
Marie Ownes,
F. Jack Smith (The Queen's University of Belfast)
Most current speech recognition systems are built upon
a single type of model, e.g. an HMM or certain type
of segment based model, and furthermore typically
employs only one type of acoustic feature e.g. MFCCs
and their variants. This entails that the system may
not be robust should the modeling assumptions be
violated. Recent research efforts have investigated
the use of multi-scale/multi-band acoustic features
for robust speech recognition. This paper described
a multi-model approach as an alternative and complement
to the multi-feature approaches. The multi-model
approach seeks a combination of different types of
acoustic model, thereby integrating the capabilities
of each individual model for capturing discriminative
information. An example system built upon the
combination of the standard HMM technique with a
segment-based modeling technique was implemented.
Experiments for both isolated-word and continuous
speech recognition have shown improved performances
over each of the individual models considered in
isolation.
|
SP-5.8
|
Speaker-Dependent Name Dialing in a Car Environment with Out-of-Vocabulary Rejection
C. S Ramalingam (Texas Instruments, Inc.),
Yifan Gong,
Lorin P Netsch,
Wallace W Anderson,
John J Godfrey,
Yu-Hung Kao
In this paper we describe a system for name dialing in
the car and present results under three driving conditions
using real-life data. The names are enrolled in the parked
car condition (engine off) and we describe two approaches
for endpointing them---energy-based and recognition-based
schemes---which result in word-based and phone-based models,
respectively. We outline a simple algorithm to reject
out-of-vocabulary names. PMC is used for noise
compensation. When tested on an internally collected
twenty-speaker database, for a list size of 50 and a
hand-held microphone, the performance averaged over all
driving conditions and speakers was 98%/92% (IV
accuracy/OOV rejection); for the hands-free data, it
was 98%/80%.
|
SP-5.9
|
A New Method Used in HMM for Modeling Frame Correlation
Qing Guo,
Fang Zheng,
Jian Wu,
Wenhu Wu (Speech Laboratory, Department of Computer Science and Technology, Tsinghua University)
In this paper we present a novel method to incorporate temporal correlation into a speech recognition system based on conventional hidden Markov model (HMM). In our new model the probability of the current observation not only depends on the current state but also depends on the previous state and the previous observation. The joint conditional PD is approximated by non-linear estimation method. As a result, we can still use mixture Gaussian density to represent the joint conditional PD for the principle of any PD can be approximated by mixture Gaussian density. The HMM incorporated temporal correlation by non-linear estimation method, which we called it FC HMM does not need any additional parameters and it only brings a little additional computing quantity. The results in the experiment show that the top 1 recognition rate of FC HMM has been raised by 6 percent compared to the conventional HMM method.
|
SP-5.10
|
N-Best Based Supervised and Unsupervised Adaptation for Native and Non-Native Speakers in Cars
Patrick Nguyen,
Philippe Gelin,
Jean-Claude Junqua (Panasonic Technologies Inc., Speech Technology Laboratory, Santa Barbara, California),
Jen-Tzung Chien (Depart. of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan)
In this paper, a new set of techniques exploiting N-best
hypotheses in supervised and unsupervised adaptation are
presented. These techniques combine statistics extracted from
the N-best hypotheses with a weight derived from a likelihood
ratio confidence measure. In the case of supervised adaptation
the knowledge of the correct string is used to perform N-best
based corrective adaptation. Experiments run for continuous
letter recognition recorded in a car environment show that
weighting N-best sequences by a likelihood ratio confidence
measure provides only marginal improvement as compared to
1-best unsupervised adaptation and N-best unsupervised
adaptation with equal weighting. However, an N-best based
supervised corrective adaptation method weighting correct
letters positively and incorrect letters negatively resulted in a 13%
decrease of the error rate as compared with supervised
adaptation. The largest improvement was obtained for non-
native speakers.
|
|