ICASSP99 ASR Systems and Applications

ASR Systems and Applications
Home Full List of Titles 1: Speech Processing CELP Coding Large Vocabulary Recognition Speech Analysis and Enhancement Acoustic Modeling I ASR Systems and Applications Topics in Speech Coding Speech Analysis Low Bit Rate Speech Coding I Robust Speech Recognition in Noisy Environments Speaker Recognition Acoustic Modeling II Speech Production and Synthesis Feature Extraction Robust Speech Recognition and Adaptation Low Bit Rate Speech Coding II Speech Understanding Language Modeling I 2: Speech Processing, Audio and Electroacoustics, and Neural Networks Acoustic Modeling III Lexical Issues/Search Speech Understanding and Systems Speech Analysis and Quantization Utterance Verification/Acoustic Modeling Language Modeling II Adaptation /Normalization Speech Enhancement Topics in Speaker and Language Recognition Echo Cancellation and Noise Control Coding Auditory Modeling, Hearing Aids and Applications of Signal Processing to Audio and Acoustics Spatial Audio Music Applications Application - Pattern Recognition & Speech Processing Theory & Neural Architecture Signal Separation Application - Image & Nonlinear Signal Processing 3: Signal Processing Theory & Methods I Filter Design and Structures Detection Wavelets Adaptive Filtering: Applications and Implementation Nonlinear Signals and Systems Time/Frequency and Time/Scale Analysis Signal Modeling and Representation Filterbank and Wavelet Applications Source and Signal Separation Filterbanks Emerging Applications and Fast Algorithms Frequency and Phase Estimation Spectral Analysis and Higher Order Statistics Signal Reconstruction Adaptive Filter Analysis Transforms and Statistical Estimation Markov and Bayesian Estimation and Classification 4: Signal Processing Theory & Methods II, Design and Implementation of Signal Processing Systems, Special Sessions, and Industry Technology Tracks System Identification, Equalization, and Noise Suppression Parameter Estimation Adaptive Filters: Algorithms and Performance DSP Development Tools VLSI Building Blocks DSP Architectures DSP System Design Education Recent Advances in Sampling Theory and Applications Steganography: Information Embedding, Digital Watermarking, and Data Hiding Speech Under Stress Physics-Based Signal Processing DSP Chips, Architectures and Implementations DSP Tools and Rapid Prototyping Communication Technologies Image and Video Technologies Automotive Applications / Industrial Signal Processing Speech and Audio Technologies Defense and Security Applications Biomedical Applications Voice and Media Processing Adaptive Interference Cancellation 5: Communications, Sensor Array and Multichannel Source Coding and Compression Compression and Modulation Channel Estimation and Equalization Blind Multiuser Communications Signal Processing for Communications I CDMA and Space-Time Processing Time-Varying Channels and Self-Recovering Receivers Signal Processing for Communications II Blind CDMA and Multi-Channel Equalization Multicarrier Communications Detection, Classification, Localization, and Tracking Radar and Sonar Signal Processing Array Processing: Direction Finding Array Processing Applications I Blind Identification, Separation, and Equalization Antenna Arrays for Communications Array Processing Applications II 6: Multimedia Signal Processing, Image and Multidimensional Signal Processing, Digital Signal Processing Education Multimedia Analysis and Retrieval Audio and Video Processing for Multimedia Applications Advanced Techniques in Multimedia Video Compression and Processing Image Coding Transform Techniques Restoration and Estimation Image Analysis Object Identification and Tracking Motion Estimation Medical Imaging Image and Multidimensional Signal Processing Applications I Segmentation Image and Multidimensional Signal Processing Applications II Facial Recognition and Analysis Digital Signal Processing Education Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z	Voice Recognition Focusing On Vowel Strings On A Fixed-Point 20-MIPs DSP Board Authors: Yukikuni Nishida, Yoshio Nakadai, Yoshitake Suzuki, Toshihide Kurokawa, Hirokazu Sato, Tetsuma Sakurai, Page (NA) Paper number 1957 Abstract: This paper describes a smart recognition system which performs character matching by replacing speech parameters with the five Japanese vowels and a few consonant categories. The proposed algorithm can make speaker-independent voice recognition. The algorithm has an advantage over the conventional speaker-independent word recognition system because it can reduce the required memory to about 0.5% of the conventional algorithm for storing the reference templates and for the instruction set, and can be performed even in a low-speed processor. We implemented this recognition algorithm in a fixed-point, 20-MIPS digital signal processor board with a 9k x 16-bit on-chip RAM. Recognition experiments using 20 Japanese city names had a 90.3% accuracy. Such an accuracy is good enough for a voice control system. IC991957.PDF (From Author) IC991957.PDF (Rasterized) TOP Speech Interface VLSI for Car Applications Authors: Makoto Shozakai, Page (NA) Paper number 1386 Abstract: A user-friendly speech interface for car applications is highly needed for safety reasons. This paper will describe a speech interface VLSI designed for car environments, with speech recognition and speech compression/decompression functions. The chip has a heterogeneous architecture composed of ADC/DAC, DSP, RISC, hard-wired logic and peripheral circuits. The DSP not only executes acoustic analysis and output probability calculation of HMMs for speech recognition, but also does speech compression/ decompression. On the other hand, the RISC works as a CPU of the whole chip and Viterbi decoder with an aid of hard-wired logic. An algorithm to recognize a mixed vocabulary of speaker-independent fixed words and speaker-dependent user-defined words in a seamless way is proposed. It is based on acoustic event HMMs which enable a template creation from one sample utterance. The proposed algorithm embedded in the chip is evaluated. Promising results of the algorithm for multiple languages are shown. IC991386.PDF (From Author) IC991386.PDF (Rasterized) TOP Recognition of Elderly Speech and Voice-Driven Document Retrieval Authors: Stephen W Anderson, Natalie Liberman, Erica Bernstein, Stephen Foster, Erin Cate, Brenda Levin, Randy Hudson, Page (NA) Paper number 2060 Abstract: We have collected a corpus of 78 hours of speech from 297 elderly speakers, with an average age of 79. We find that acoustic models built from elderly speech provide much better recognition than do non-elderly models (42.1 vs. 54.6% WER). We also find that elderly men have substantially higher word error rates than elderly women (typically 14% absolute). We report on other experiments with this corpus, dividing the speakers by age, by gender, and by regional accent. Using the resulting "elderly acoustic model", we built a document-retrieval program that can be operated by voice or typing. After usability tests with 110 speakers, we tested the final system on 37 elderly speakers. Each retrieved 4 documents from a database of 86,190 Boston Globe articles, 2 by typing and 2 by speech. We measured how quickly they retrieved each article, and how much help they required. We find no difference between spoken and typed queries in either retrieval times or in amount of help required, regardless of age, gender, or computer experience. However, users perceive speech to be substantially faster, and overwhelmingly prefer speech to typing. IC992060.PDF (From Author) IC992060.PDF (Rasterized) TOP A Comparison of Features for Speech, Music Discrimination Authors: Michael J Carey, Eluned S Parris, Harvey Lloyd-Thomas, Page (NA) Paper number 1432 Abstract: Several approaches have previously been taken to the problem of discriminating between speech and music signals. These have used different features as the input to the classifier and have tested and trained on different material. In this paper we examine the discrimination achieved by several different features using common training and test sets and the same classifier. The database assembled for these tests includes speech from thirteen languages and music from all over the world. In each case the distributions in the feature space were modelled by a Gaussian mixture model. Experiments were carried out on four types of feature, amplitude, cepstra, pitch and zero-crossings. In each case the derivative of the feature was also used and found to improve performance. The best performance resulted from using the cepstra and delta cepstra which gave an equal error rate (EER) of 1.2%. This was closely followed by normalised amplitude and delta amplitude. This however used a much less complex model. The pitch and delta pitch gave an EER of 4% which was better than the zero-crossing which produced an EER of 6%. IC991432.PDF (From Author) IC991432.PDF (Rasterized) TOP Recognizing Connected Digits in a Natural Spoken Dialog Authors: Mazin G Rahim, Page (NA) Paper number 2011 Abstract: This paper addresses the general problem of connected digit recognition in the telecommunication environment. In particular, we focus on a task of recognizing digits when embedded in a natural spoken dialog. Two different design strategies are investigated: keyword detection or word spotting, and large-vocabulary continuous speech recognition. We will characterize the potential benefits and describe the main components of each design method, including acoustic and language modeling, training and utterance verification. Experimental results on a subset of a database that includes customers responses to the open-ended prompt ``How may I help you?'' are presented. IC992011.PDF (From Author) IC992011.PDF (Rasterized) TOP Telephone Speech Recognition Using Neural Networks and Hidden Markov Models Authors: Dong-Suk Yuk, James L Flanagan, Page (NA) Paper number 1872 Abstract: The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of transmission channels. In this paper, neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed. The advantage of the neural network based approach is that the retraining of speech recognizers for telephone speech is avoided. Furthermore, because the multi-layer neural network is able to compute nonlinear functions, it can accommodate for the nonlinear mapping between full bandwidth speech and telephone speech. The new unsupervised model adaptation method does not require transcriptions and can be used with the neural networks. Experimental results on TIMIT/NTIMIT corpora show that the performance of the proposed methods is comparable to that of recognizers retrained on telephone speech. IC991872.PDF (From Author) IC991872.PDF (Scanned) TOP Improving Speech Recognition Performance By Using Multi-Model Approaches Authors: Ji Ming, Philip Hanna, Darryl Stewart, Marie Ownes, F. Jack Smith, Page (NA) Paper number 1259 Abstract: Most current speech recognition systems are built upon a single type of model, e.g. an HMM or certain type of segment based model, and furthermore typically employs only one type of acoustic feature e.g. MFCCs and their variants. This entails that the system may not be robust should the modeling assumptions be violated. Recent research efforts have investigated the use of multi-scale/multi-band acoustic features for robust speech recognition. This paper described a multi-model approach as an alternative and complement to the multi-feature approaches. The multi-model approach seeks a combination of different types of acoustic model, thereby integrating the capabilities of each individual model for capturing discriminative information. An example system built upon the combination of the standard HMM technique with a segment-based modeling technique was implemented. Experiments for both isolated-word and continuous speech recognition have shown improved performances over each of the individual models considered in isolation. IC991259.PDF (From Author) IC991259.PDF (Rasterized) TOP Speaker-Dependent Name Dialing in a Car Environment with Out-of-Vocabulary Rejection Authors: C.S. Ramalingam, Yifan Gong, Lorin P Netsch, Wallace W Anderson, John J. Godfrey, Yu-Hung Kao, Page (NA) Paper number 1780 Abstract: In this paper we describe a system for name dialing in the car and present results under three driving conditions using real-life data. The names are enrolled in the parked car condition (engine off) and we describe two approaches for endpointing them---energy-based and recognition-based schemes---which result in word-based and phone-based models, respectively. We outline a simple algorithm to reject out-of-vocabulary names. PMC is used for noise compensation. When tested on an internally collected twenty-speaker database, for a list size of 50 and a hand-held microphone, the performance averaged over all driving conditions and speakers was 98%/92% (IV accuracy/OOV rejection); for the hands-free data, it was 98%/80%. IC991780.PDF (From Author) IC991780.PDF (Rasterized) TOP A New Method Used in HMM for Modeling Frame Correlation Authors: Qing Guo, Fang Zheng, Jian Wu, Wenhu Wu, Page (NA) Paper number 1172 Abstract: In this paper we present a novel method to incorporate temporal correlation into a speech recognition system based on conventional hidden Markov model (HMM). In our new model the probability of the current observation not only depends on the current state but also depends on the previous state and the previous observation. The joint conditional PD is approximated by non-linear estimation method. As a result, we can still use mixture Gaussian density to represent the joint conditional PD for the principle of any PD can be approximated by mixture Gaussian density. The HMM incorporated temporal correlation by non-linear estimation method, which we called it FC HMM does not need any additional parameters and it only brings a little additional computing quantity. The results in the experiment show that the top 1 recognition rate of FC HMM has been raised by 6 percent compared to the conventional HMM method. IC991172.PDF (From Author) IC991172.PDF (Rasterized) TOP N-Best Based Supervised and Unsupervised Adaptation for Native and Non-Native Speakers in Cars Authors: Patrick Nguyen, Philippe Gelin, Jean-Claude Junqua, Jen-Tzung Chien, Depart. of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan (Taiwan) Page (NA) Paper number 2084 Abstract: In this paper, a new set of techniques exploiting N-best hypotheses in supervised and unsupervised adaptation are presented. These techniques combine statistics extracted from the N-best hypotheses with a weight derived from a likelihood ratio confidence measure. In the case of supervised adaptation the knowledge of the correct string is used to perform N-best based corrective adaptation. Experiments run for continuous letter recognition recorded in a car environment show that weighting N-best sequences by a likelihood ratio confidence measure provides only marginal improvement as compared to 1-best unsupervised adaptation and N-best unsupervised adaptation with equal weighting. However, an N-best based supervised corrective adaptation method weighting correct letters positively and incorrect letters negatively resulted in a 13% decrease of the error rate as compared with supervised adaptation. The largest improvement was obtained for non- native speakers. IC992084.PDF (From Author) IC992084.PDF (Rasterized) TOP