SP-5.1

VOICE RECOGNITION FOCUSING ON VOWEL STRINGS ON A FIXED-POINT 20-MIPS DSP BOARD
Yukikuni NISHIDA, Yoshio NAKADAI, Yoshitake SUZUKI (Nippon Telegraph and Telephone Corporation), Toshihide KUROKAWA, Hirokazu SATO (NTT Advanced Technology Corporation), Tetsuma SAKURAI (Nippon Telegraph and Telephone Corporation)

This paper describes a smart recognition system which performs character matching by replacing speech parameters with the five Japanese vowels and a few consonant categories. The proposed algorithm can make speaker-independent voice recognition. The algorithm has an advantage over the conventional speaker-independent word recognition system because it can reduce the required memory to about 0.5% of the conventional algorithm for storing the reference templates and for the instruction set, and can be performed even in a low-speed processor. We implemented this recognition algorithm in a fixed-point, 20-MIPS digital signal processor board with a 9k x 16-bit on-chip RAM. Recognition experiments using 20 Japanese city names had a 90.3% accuracy. Such an accuracy is good enough for a voice control system.

SP-5.2

Speech Interface VLSI for Car Applications
Makoto Shozakai (Asahi Chemical Industry Co., Ltd., LSI Labs.)

A user-friendly speech interface for car applications is highly needed for safety reasons. This paper will describe a speech interface VLSI designed for car environments, with speech recognition and speech compression/decompression functions. The chip has a heterogeneous architecture composed of ADC/DAC, DSP, RISC, hard-wired logic and peripheral circuits. The DSP not only executes acoustic analysis and output probability calculation of HMMs for speech recognition, but also does speech compression/ decompression. On the other hand, the RISC works as a CPU of the whole chip and Viterbi decoder with an aid of hard-wired logic. An algorithm to recognize a mixed vocabulary of speaker-independent fixed words and speaker-dependent user-defined words in a seamless way is proposed. It is based on acoustic event HMMs which enable a template creation from one sample utterance. The proposed algorithm embedded in the chip is evaluated. Promising results of the algorithm for multiple languages are shown.

SP-5.3

Recognition of Elderly Speech and Voice-Driven Document Retrieval
Stephen W Anderson, Natalie Liberman, Erica Bernstein, Stephen Foster, Erin Cate, Brenda Levin (Dragon Systems), Randy Hudson (AverStar)

We have collected a corpus of 78 hours of speech from 297 elderly speakers, with an average age of 79. We find that acoustic models built from elderly speech provide much better recognition than do non-elderly models (42.1 vs. 54.6% WER). We also find that elderly men have substantially higher word error rates than elderly women (typically 14% absolute). We report on other experiments with this corpus, dividing the speakers by age, by gender, and by regional accent. Using the resulting "elderly acoustic model", we built a document-retrieval program that can be operated by voice or typing. After usability tests with 110 speakers, we tested the final system on 37 elderly speakers. Each retrieved 4 documents from a database of 86,190 Boston Globe articles, 2 by typing and 2 by speech. We measured how quickly they retrieved each article, and how much help they required. We find no difference between spoken and typed queries in either retrieval times or in amount of help required, regardless of age, gender, or computer experience. However, users perceive speech to be substantially faster, and overwhelmingly prefer speech to typing.

SP-5.4

A Comparison of Features for Speech, Music Discrimination
Michael J Carey, Eluned S Parris, Harvey Lloyd-Thomas (Ensigma Ltd)

Several approaches have previously been taken to the problem of discriminating between speech and music signals. These have used different features as the input to the classifier and have tested and trained on different material. In this paper we examine the discrimination achieved by several different features using common training and test sets and the same classifier. The database assembled for these tests includes speech from thirteen languages and music from all over the world. In each case the distributions in the feature space were modelled by a Gaussian mixture model. Experiments were carried out on four types of feature, amplitude, cepstra, pitch and zero-crossings. In each case the derivative of the feature was also used and found to improve performance. The best performance resulted from using the cepstra and delta cepstra which gave an equal error rate (EER) of 1.2%. This was closely followed by normalised amplitude and delta amplitude. This however used a much less complex model. The pitch and delta pitch gave an EER of 4% which was better than the zero-crossing which produced an EER of 6%.

SP-5.5

Recognizing Connected Digits in a Natural Spoken Dialog
Mazin G Rahim (AT&T Labs - Research)

This paper addresses the general problem of connected digit recognition in the telecommunication environment. In particular, we focus on a task of recognizing digits when embedded in a natural spoken dialog. Two different design strategies are investigated: keyword detection or word spotting, and large-vocabulary continuous speech recognition. We will characterize the potential benefits and describe the main components of each design method, including acoustic and language modeling, training and utterance verification. Experimental results on a subset of a database that includes customers responses to the open-ended prompt ``How may I help you?'' are presented.

SP-5.6

Telephone Speech Recognition Using Neural Networks and Hidden Markov Models
DongSuk Yuk (Department of Computer Science, Rutgers University), James L Flanagan (CAIP center, Rutgers University)

The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of transmission channels. In this paper, neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed. The advantage of the neural network based approach is that the retraining of speech recognizers for telephone speech is avoided. Furthermore, because the multi-layer neural network is able to compute nonlinear functions, it can accommodate for the nonlinear mapping between full bandwidth speech and telephone speech. The new unsupervised model adaptation method does not require transcriptions and can be used with the neural networks. Experimental results on TIMIT/NTIMIT corpora show that the performance of the proposed methods is comparable to that of recognizers retrained on telephone speech.

SP-5.7

Improving Speech Recognition Performance By Using Multi-Model Approaches
Ji Ming, Philip Hanna, Darryl Stewart, Marie Ownes, F. Jack Smith (The Queen's University of Belfast)

Most current speech recognition systems are built upon a single type of model, e.g. an HMM or certain type of segment based model, and furthermore typically employs only one type of acoustic feature e.g. MFCCs and their variants. This entails that the system may not be robust should the modeling assumptions be violated. Recent research efforts have investigated the use of multi-scale/multi-band acoustic features for robust speech recognition. This paper described a multi-model approach as an alternative and complement to the multi-feature approaches. The multi-model approach seeks a combination of different types of acoustic model, thereby integrating the capabilities of each individual model for capturing discriminative information. An example system built upon the combination of the standard HMM technique with a segment-based modeling technique was implemented. Experiments for both isolated-word and continuous speech recognition have shown improved performances over each of the individual models considered in isolation.

SP-5.8

Speaker-Dependent Name Dialing in a Car Environment with Out-of-Vocabulary Rejection
C. S Ramalingam (Texas Instruments, Inc.), Yifan Gong, Lorin P Netsch, Wallace W Anderson, John J Godfrey, Yu-Hung Kao

In this paper we describe a system for name dialing in the car and present results under three driving conditions using real-life data. The names are enrolled in the parked car condition (engine off) and we describe two approaches for endpointing them---energy-based and recognition-based schemes---which result in word-based and phone-based models, respectively. We outline a simple algorithm to reject out-of-vocabulary names. PMC is used for noise compensation. When tested on an internally collected twenty-speaker database, for a list size of 50 and a hand-held microphone, the performance averaged over all driving conditions and speakers was 98%/92% (IV accuracy/OOV rejection); for the hands-free data, it was 98%/80%.

SP-5.9

A New Method Used in HMM for Modeling Frame Correlation
Qing Guo, Fang Zheng, Jian Wu, Wenhu Wu (Speech Laboratory, Department of Computer Science and Technology, Tsinghua University)

In this paper we present a novel method to incorporate temporal correlation into a speech recognition system based on conventional hidden Markov model (HMM). In our new model the probability of the current observation not only depends on the current state but also depends on the previous state and the previous observation. The joint conditional PD is approximated by non-linear estimation method. As a result, we can still use mixture Gaussian density to represent the joint conditional PD for the principle of any PD can be approximated by mixture Gaussian density. The HMM incorporated temporal correlation by non-linear estimation method, which we called it FC HMM does not need any additional parameters and it only brings a little additional computing quantity. The results in the experiment show that the top 1 recognition rate of FC HMM has been raised by 6 percent compared to the conventional HMM method.

SP-5.10

N-Best Based Supervised and Unsupervised Adaptation for Native and Non-Native Speakers in Cars
Patrick Nguyen, Philippe Gelin, Jean-Claude Junqua (Panasonic Technologies Inc., Speech Technology Laboratory, Santa Barbara, California), Jen-Tzung Chien (Depart. of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan)

In this paper, a new set of techniques exploiting N-best hypotheses in supervised and unsupervised adaptation are presented. These techniques combine statistics extracted from the N-best hypotheses with a weight derived from a likelihood ratio confidence measure. In the case of supervised adaptation the knowledge of the correct string is used to perform N-best based corrective adaptation. Experiments run for continuous letter recognition recorded in a car environment show that weighting N-best sequences by a likelihood ratio confidence measure provides only marginal improvement as compared to 1-best unsupervised adaptation and N-best unsupervised adaptation with equal weighting. However, an N-best based supervised corrective adaptation method weighting correct letters positively and incorrect letters negatively resulted in a 13% decrease of the error rate as compared with supervised adaptation. The largest improvement was obtained for non- native speakers.

< SP-4 SP-6 >

Last Update: February 4, 1999 Ingo Höntsch