Chair: K. Paliwal, Griffith University
John H. L. Hansen, Duke University (USA)
Levent M. Arslan, Duke University (USA)
Speaker accent is an important issue in the formulation of robust speaker independent recognition systems. Knowledge gained from a reliable accent classification approach could improve overall recognition performance. In this paper, a new algorithm is proposed for foreign accent classification of American English. A series of experimental studies are considered which focus on establishing how speech production is varied to convey accent. The proposed method uses a source generator framework, recently proposed for analysis and recognition of speech under stress[5]. An accent sensitive database is established using speakers of American English with foreign language accents. An initial version of the classification algorithm classified speaker accent from among four different accents with an accuracy of 81.5% in the case of unknown text, and 88.9% assuming known text. Finally, it is shown that as accent sensitive word count increases, the ability to correctly classify accent also increases, achieving an overall classification rate of 92% amoung four accent classes.
R. Haeb-Umbach, Philips Research Laboratories-Aachen (GERMANY)
P. Beyerlein, Philips Research Laboratories-Aachen (GERMANY)
E. Thelen, Philips Research Laboratories-Aachen (GERMANY)
We address the problem of automatically finding an acoustic representation (i.e. a transcription) of unknown words as a sequence of subword units, given a few sample utterances of the unknown words, and an inventory of speaker-independent subword units. The problem arises if a user wants to add his own vocabulary to a speaker-independent recognition system simply by speaking the words a few times. Two methods are investigated which are both based on a maximum-likelihood formulation of the problem. The experimental results show that both automatic transcription methods provide a good estimate of the acoustic models of unknown words. The recognition error rates obtained with such models in a speaker- independent recognition task are clearly better than those resulting from separate whole-word models. They are comparable with the performance of transcriptions drawn from a dictionary.
Silvana L. do N. Cunha Costa, Universidade Federal da Paraiba (BRAZIL)
Benedito G. Aguiar Neto, Universidade Federal da Paraiba (BRAZIL)
This paper presents an evaluation of an adaptive multichannel system to enhancement of speech degraded by environment acoustic noise. The system input is a four microphone array connected to an automatic phase alignment unit which gives a gain of 2 to 4 dB. It's made an estimation of the spectral densities of speech and noise to be used as input to a Wiener-filter which coefficients are computed by a frequency domain algorithm. The signal estimated on filter output presents a total gain in the signal-to-noise ratio of about 10dB. Informal subjective listening tests indicates an improving in the signal quality and its intelligibility is considered very good.
Udo Bub, Carnegie Mellon University (USA)
Martin Hunke, Carnegie Mellon University (USA)
Alex Waibel, Carnegie Mellon University (USA)
With speech recognition systems steadily improving in performance, freedom from head--sets and push--buttons to activate the recognizer is one of the most important issues to achieve user acceptance. Microphone arrays and beamforming can deliver signals that suppress undesired jamming signals but rely on knowledge where the signal is in space. This knowledge is usually derived by identifying the loudest signal source. Knowing who is speaking to whom and where should however not depend on loudness, but on the communication purpose. In this paper, we present acoustic AND visual modules that use tracking of the face of a speaker of interest for sound source localization and beamforming for signal extraction. It is shown that in noisy environments a more accurate localization in space can be delivered visually than acoustically. Given a reliable location finder, beamforming substantially improves recognition accuracy.
Jean-Claude Junqua, Speech Technology Laboratory (USA)
Stephane Valente, Speech Technology Laboratory (USA)
Dominique Fohr, CRIN/INRIA (FRANCE)
Jean-FranCois Mari, CRIN/INRIA (FRANCE)
In this paper, we introduce SmarTspelL, a new speaker-independent algorithm to recognize continuously spelled names over the telephone. Our method is based on an N-best multi- pass recognition strategy applying costly constraints when the number of possible candidates is low. This strategy outperforms an HMM recognizer using a grammar containing all the possible names. It is also more suitable to real-time. For a 3,388 name dictionary, a 95.3% name recognition rate is obtained. A real-time prototype has been implemented on a workstation. We also present comparisons of different feature sets for speech representation, and two speech recognition approaches based on first- and second-order HMMs.
Martin Betz, Universitat Karlsruhe (GERMANY)
Hermann Hild, Universitat Karlsruhe (GERMANY)
In some speech recognition applications, it is reasonable to constrain the search space of a speech recognizer to a large but finite set of sentences. We demonstrate the problem on a spelling task, where the recognition of continuously spelled last names is constrained to 110,000 entries (= 43,000 unique names) of a telephone book. Several techniques to address this problem are compared: recognition without any language model, bigrams, functions to map a hypothesis onto a legal string, n-best lists, and finally a newly developed method which integrates all constraints directly into the search process within reasonable memory and time bounds. The baseline result of 56% string accuracy is improved to 62, 85, 88, and 92%, respectively.
D. Giuliani, IRST (ITALY)
M. Matassoni, IRST (ITALY)
M. Omologo, IRST (ITALY)
P. Svaizer, IRST (ITALY)
This paper describes recent advances on the use of HMM based technology, for speaker independent continuous speech recognition, in noisy environment, under hands free interaction mode. For this purpose an array of four ominidirectional microphones is employed as acquisition system. The processing of phase information in the Cross- power Spectrum provides the capability both of locating talker position and of reconstructing an enhanced speech spectrum. Here, two enhancement techniques are described, that allow recognition improvement in the case of clean input speech as well as under different adverse conditions. Results refer to the use of a new multichannel corpus, collected in real environment by microphone array as well as close-talk microphone.
Toru Imai, NHK Science & Technology Research Labs (JAPAN)
Akio Ando, NHK Science & Technology Research Labs (JAPAN)
Eiichi Miyasaka, NHK Science & Technology Research Labs (JAPAN)
This paper presents a new method for automatic generation of speaker-dependent phonological rules in order to decrease recognition errors caused by pronunciation variability. The proposed method generates phonological rules by using objective speaker's continuous speech and corresponding standard pronunciation, resulting in forming a multiple- pronunciation dictionary from a single-pronunciation dictionary. The method makes it possible to generate automatically speaker-dependent and recognizer-dependent phonological rules, and be applied to both a top-down recognizer and a bottom-up recognizer, while conventional methods are based on hand-derived general phonological rules such as coarticulation knowledge or are applied only to a bottom-up recognizer. Phrase recognition experiments with concatenated phoneme HMMs showed that the generated rules can decrease recognition errors and play a role of speaker adaptation at the phonological level.
David L. Jennings, AFIT/ENG (USA)
Dennis W. Ruck, AFIT/ENG (USA)
This paper presents the results of experimentation with a simple ultrasonic lip motion detector or "Ultrasonic Mike" in automatic speech recognition. The device is tested in a speaker dependent isolated word recognition task with a vocabulary consisting of the spoken digits from zero to nine. The "Ultrasonic Mike" is used as input to an automatic lip reader. The automatic lip reader uses template matching and dynamic time warping to determine the best candidate for a given test utterance. The device is first tested as a stand alone automatic lip reader achieving accuracy as high as 89%. Next the automatic lip reader is combined with a conventional automatic speech recognizer. Classifier fusion is based on a pseudo probability mass function derived from the dynamic warping distances. The combined system is tested with various levels of acoustic noise added. In a typical example at 0dB, the acoustic recognizer's accuracy was 78%, the lip reader accuracy was at 69%, but the combined accuracy was 93%. This experiment demonstrates that this simple ultrasonic lip motion detector, that has an output data rate 12,500 times less than a typical video camera, can improve automatic speech recognition in noisy environments. This experiment also demonstrates an effective classifier fusion algorithm based on dynamic time warping distances.
Basilis Gidas, Brown University
Alejandro Murua, University of Chicago (USA)
We propose a new algorithmic method for the classification and clustering of the English six stop consonants /p, t, k, b, d, g/, on the basis of CV (Consonant-Vowel) or VC syllables data. The method explores two powerful tools: (1) a wavelet representation of the acoustic signal and its induced waveletogram a time domain analogue of the spectrogram; (2) nonparametric transformations of the waveletogram and a nonlinear discriminant analysis based on these transformations. The procedure has yielded better rates of correct classification than previous methods. Moreover, it yields interesting two-dimensional clustering plots for stop consonants as well as for vowels. The clustering plots for vowels are as separating as those based on the first and second formants; we know of no other method in the literature that yields clustering plots for consonants.