SP-26.1

Implications of Glottal Source for Speaker and Dialect Identification
Lisa R. Yanguas, Thomas F Quatieri (MIT Lincoln Laboratory), Fred Goodman (Autometric)

In this paper we explore the importance of speaker specific information carried in the glottal source by interchanging glottal excitation and vocal tract system information for speaker pairs through analysis/synthesis techniques. Earlier work has shown the importance of glottal flow derivative waveform model parameters in automatic speaker identification [2] and also in voice style (i.e., creaky or aspirated) [1]. In our work, after matching phonemes across utterance pairs, we separately interchange three excitation characteristics: timing, pitch, and glottal flow derivative and investigate their relative importance in two case studies. Through time alignment and pitch and glottal flow transformations, we can make a speaker of a northern dialect sound more like his southern counterpart, and also through these processes a Peruvian speaker is made to sound more Cuban-like. From these experiments we conclude that significant speaker and dialect specific information, such as noise, breathiness or aspiration, and vocalization and stridency, is carried in the glottal signal. Based on these findings, it appears that although dialect identification has usually been approached in a manner similar to language identification, it may actually be more closely related to the speaker identification task. The work suggests further study in linking speaker and dialect identification more closely. We plan, for example, to explore other speaker identification approaches to dialect identification in the hope of improved performance in automatic dialect identification algorithms.

SP-26.2

Automatic Speaker Clustering from Multi-speaker Utterances
Jack McLaughlin, Douglas Reynolds, Elliot Singer, Gerald C O'Leary (MIT Lincoln Laboratory)

Blind clustering of multi-person utterances by speaker is complicated by the fact that each utterance has at least two talkers. In the case of a two-person conversation, one can simply split each conversation into its respective speaker halves, but this introduces error which ultimately hurts clustering. We propose a clustering algorithm which is capable of associating each conversation with two clusters (and therefore two-speakers) obviating the need for splitting. Results are given for two speaker conversations culled from the Switchboard corpus, and comparisons are made to results obtained on single-speaker utterances. We conclude that although the approach is promising, our technique for computing inter-conversation similarities prior to clustering needs improvement.

SP-26.3

Detection of Target Speakers in Audio Databases
Ivan Magrin-Chagnolleau, Aaron E Rosenberg, S. Parthasarathy (AT&T Labs-Research)

The problem of speaker detection in audio databases is addressed in this paper. Gaussian mixture modeling is used to build target speaker and background models. A detection algorithm based on a likelihood ratio calculation is applied to estimate target speaker segments. Evaluation procedures are defined in detail for this task. Results are given for different subsets of the HUB4 broadcast news database. For one target speaker, with the data restricted to high quality speech segments, the segment miss rate is approximately 7%. For unrestricted data the segment miss rate is approximately 27%. In both cases the false alarm rate is 4 or 5 per hour. For two target speakers with unrestricted data, the segment miss rate is approximately 63% with about 27 segment false alarms per hour. The decrease in performance for two target speakers is largely associated with short speech segments in the two target speaker test data which are undetectable in the current configuration of the detection algorithm

SP-26.4

Background Model Design for Flexible and Portable Speaker Verification Systems
Olivier Siohan, Chin H Lee, Arun C Surendran, Qi Li (Bell Laboratories - Lucent Technologies)

Most state-of-the art speaker verification systems need a user model built from samples of the customer speech, and a speaker independent (SI) background model with high acoustic resolution. These systems rely heavily on the availability of speaker independent databases along with a priori knowledge about acoustic rules of the utterance, and depend on the consistency of acoustic conditions under which the SI models were trained. These constraints may be a burden in practical and portable devices such as palm-top computers or wireless handsets which place a premium on computation and memory, and where the user is free to choose any password utterance in any language, under any acoustic condition. In this paper, we present a novel and reliable approach to background model design when only the enrollment data is available. Preliminary results are provided to demonstrate the effectiveness of such systems.

SP-26.5

CORPORA FOR THE EVALUATION OF SPEAKER RECOGNITION SYSTEMS
Joseph P Campbell (Department of Defense), Douglas A Reynolds (MIT Lincoln Laboratory)

Using standard speech corpora for development and evaluation has proven to be very valuable in promoting progress in speech and speaker recognition research. In this paper, we present an overview of current publicly available corpora intended for speaker recognition research and evaluation. We outline the corpora�s salient features with respect to their suitability for conducting speaker recognition experiments and evaluations. Links to these corpora, and to new corpora, will appear on the web http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/. We hope to increase the awareness and use of these standard corpora and corresponding evaluation procedures throughout the speaker recognition community.

SP-26.6

An Unsupervised Approach to Language Identification
Francois Pellegrino, R�gine Andr�-Obrecht (IRIT - Institut de Recherche en Informatique de Toulouse)

This paper presents an unsupervised approach to Automatic Language Identification (ALI) based on vowel system modeling. Each language vowel system is modeled by a Gaussian Mixture Model (GMM) trained with automatically detected vowels. Since this detection is unsupervised and language independent, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen algorithm. With 5 language from the OGI MLTS corpus and in a close set identification task, we reach 79 % of correct identification using only the vowel segments detected in 45 second duration utterances for the male speakers.

SP-26.7

AN EXPERIMENTAL STUDY OF SPEAKER VERIFICATION SENSITIVITY TO COMPUTER VOICE-ALTERED IMPOSTERS
Bryan L. Pellom, John H.L. Hansen (Duke University)

This paper investigates the relative sensitivity of a GMM-based voice verification algorithm to computer voice-altered imposters. First, a new trainable speech synthesis algorithm based on trajectory models of the speech Line Spectral Frequency (LSF) parameters is presented in order to model the spectral characteristics of a target voice. A GMM-based speaker verifier is then constructed for the 138 speaker YOHO database and shown to have an initial equal-error rate (EER) of 1.45% for the case of casual imposter attempts using a single combination-lock phrase test. Next, imposter voices are automatically altered using the synthesis algorithm to mimic the customer's voice. After voice transformation, the false acceptance rate is shown to increase from 1.45% to over 86% if the baseline EER threshold is left unmodified. Furthermore, at a customer false rejection rate of 25%, the false acceptance rate for the voice-altered imposter remains as high as 34.6%.

SP-26.8

A NEW COHORT NORMALIZATION USING LOCAL ACOUSTIC INFORMATION FOR SPEAKER VERIFICATION
Toshihiro Isobe, Jun-ichi Takahashi (Laboratory for Information Technology, NTT DATA CORPORATION)

This paper describes a new cohort normalization method for HMM based speaker verification. In the proposed method, cohort models are synthesized based on the similarity of local acoustic features between speakers. The similarity can be determined using acoustic information lying in model components such as phonemes, states, and the Gaussian distributions of HMMs. With the method, the synthesized models can provide an effective normalizing score for various observed measurements because the difference between the individual reference model and the synthesized cohort models is statistically reduced through fine evaluation of acoustic similarity in model structure level. In the experiments using telephone speech of 100 speakers, it was found that high verification performance can be achieved by the proposed method: the Equal Error Rate (EER) was drastically reduced from 1.20 % (obtained by the conventional speaker-selection based cohort normalization) to 0.30 % (obtained by the proposed method on distribution-based selection) in closed test. Furthermore, EER was also reduced from 1.40 % to 0.70% in open test (reference speaker: 25, impostor: 75), when the other speakers than the reference speaker were used as impostors.

SP-26.9

ON THE USE OF ORTHOGONAL GMM IN SPEAKER RECOGNITION
Li Liu, Jialong He (Dept. of Speech & Hearing Science, Arizona State University)

The Gaussian mixture modeling (GMM) techniques are increasingly being used for both speaker identification and verification. Most of these models assume diagonal covariance matrices. Although empirically any distribution can be approximated with a diagonal GMM, a large number of mixture components are usually needed to obtain a good approximation. A consequence of using a large GMM is that its training is time consuming and its response speed is very slow. This paper proposes a modification to the standard diagonal GMM approach. The proposed scheme includes an orthogonal transformation: feature vectors are first transformed to the space spanned by the eigenvectors of the covariance matrix before applying to the diagonal GMM. Only a small computational load is introduced by this transformation, but results from both speaker identification and verification experiments indicated that the orthogonal transformation considerably improves the recognition performance. For a specific performance level, the GMM with orthogonal transform needs only one-fourth the number of Gaussian functions required by the standard GMM.

SP-26.10

Reusable Binary-Paired Partitioned Neural Networks for Text-Independent Speaker Identification
Stephen A Zahorian (Old Dominion University)

A neural network algorithm for speaker identification with large groups of speakers is described. This technique is derived from a technique in which an N-way speaker identification task is partitioned into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a small neural network which is a two-way, or pair-wise, network. The decisions of these two-way networks are then combined to make the N-way speaker identification decision (Rudasi and Zahorian, 1991 and 1992). Although very accurate, this method has the drawback of requiring a very large number of pair-wise networks. In the new approach, two-way neural network classifiers, each of which is trained only to separate two speakers, are also used to separate other pairs of speakers. This method is able to greatly reduce the number of pair-wise classifiers required for making an N-way classification decision, especially when the number of speakers is very large. For 100 speakers extracted from the TIMIT database, the number of pair-wise classifiers can be reduced by approximately a factor of 5, with only minor degradation in performance when 3 seconds or more of speech is used for identification. Using all 630 speakers from the TIMIT database, this method can be used to obtain over 99.7% accuracy. With the telephone version of the same database, an accuracy of 40.2% can be obtained.

< SP-25

Last Update: February 4, 1999 Ingo Höntsch