Session: SPEECH-P6
Time: 9:30 - 11:30, Thursday, May 10, 2001
Location: Exhibit Hall Area 6
Title: Speaker Recognition 1
Chair: Thomas Quateri

9:30, SPEECH-P6.1
COMPARISON OF DIFFERENT OBJECTIVE FUNCTIONS FOR OPTIMAL LINEAR COMBINATION OF CLASSIFIERS FOR SPEAKER IDENTIFICATION
H. ALTINCAY, M. DEMIREKLER
This paper presents a comparison of objective functions for optimally combining different speaker identification systems. The comparison is based on the classification performance of the resultant multiple classifier system (MCS). The objective functions considered are; classification figure of merit (CFM), mean square error (MSE) and cross entropy (CE). In all three methods, the outputs of individual classifiers are assumed to be the posterior probabilities of each speaker and linear combination of the output vectors are considered. CFM seeks to maximize the difference between the output value of the speaker and the output values of all other incorrect speakers. On the other hand, MSE and CE compares the outputs with some ideal vectors where the output of the correct speaker is set to one and the others are zero. The experimental results are also compared with the averaging method where the combination is not optimized. Our simulation experiments on four different sets of speakers have shown that CFM performs better compared to the other objective functions.

9:30, SPEECH-P6.2
FRACTAL DIMENSION APPLIED TO SPEAKER IDENTIFICATION
D. BARONE, A. PETRY
This paper reports the results obtained in a speaker identification system based in Bhattacharrya distance, which combines LP-derived cepstral coefficients, with a nonlinear dynamic feature namely fractal dimension. The nonlinear dynamic analysis starts with the phase espace reconstruction, and the fractal dimension of the correspondent attractor trajectory is estimated. This analysis is performed in every speech window, providing a measure of a time-dependent fractal dimension. The corpus used in the tests is composed by 37 different speakers, and the best results are obtained when the fractal dimension is included, suggesting that the information added with this feature was not present so far.

9:30, SPEECH-P6.3
SOURCE AND SYSTEM FEATURES FOR SPEAKER RECOGNITION USING AANN MODELS
B. YEGNANARAYANA, K. REDDY, S. KISHORE
In this paper we study the effectiveness of the features extracted from the source and system components of speech production process for the purpose of speaker recognition. The source and system components are derived using linear prediction (LP) analysis of short segments of speech. The source component is the LP residual derived from the signal, and the system component is a set of weighted linear prediction cepstral coefficients. The features are captured implicitly by a feedforward autoassociative neural network (AANN). Two separate speaker models are derived by training two AANN models using feature vectors corresponding to source and system components. A speaker recognition system for 20 speakers is built and tested using both the models to evaluate the performance of source and system features. The study demonstrates the complementary nature of the two components.

9:30, SPEECH-P6.4
SPEAKER CHANGE DETECTION AND SPEAKER CLUSTERING USING VQ DISTORTION FOR BROADCAST NEWS SPEECH RECOGNITION
K. MORI, S. NAKAGAWA
This paper addresses the problem of the detection of speaker changes and clustering speakers when no information is available regarding speaker classes or even the total number of classes. We assume that no previous information on speakers is available (no speaker model, no training phase) and that people do not speak simultaneously. The aim is to apply speaker grouping information to speaker adaptation for speech recognition. We use Vector Quantization (VQ) distortion as the criterion. A speaker model is created from successive utterances as a codebook by a VQ algorithm, and the VQ distortion is calculated between the model and an utterance. The result was given by the experiment on speaker detection and speaker clustering. The speaker change detection experiment was compared with results by Generalized Likelihood Ratio (GLR) and Bayesian Information Criterion (BIC). We show the superiority of our proposed method.

9:30, SPEECH-P6.5
A HYBRID GMM/SVM APPROACH TO SPEAKER IDENTIFICATION
S. FINE, J. NAVRATIL, R. GOPINATH
This paper proposes a classification scheme that incorporates statistical models and support vector machines. A hybrid system which appropriately combines the advantages of both the generative and discriminant model paradigms is described and experimentally evaluated on a text-independent speaker recognition task in matched and mismatched training and test conditions. Our results prove that the combination is beneficial in terms of performance and practical in terms of computation. We report relative improvements of up to 25% reduction in identification error rate compared to the baseline statistical model.

9:30, SPEECH-P6.6
DEVELOPING USABLE SPEECH CRITERIA FOR SPEAKER IDENTIFICATION
J. LOVEKIN, R. YANTORNO, K. KRISHNAMACHARI , D. BENINCASA , S. WENNDT
Recently, a “usable speech” extraction system [1] was proposed to separate co-channel speech into “usable” frames that are minimally corrupted by interfering speech. Studies indicate [2] that a significant amount of co-channel speech can be considered “usable” for speaker identification (SID). Therefore, it is necessary to establish criteria for usable speech frames for SID. Voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID. In addition, SID accuracy increases as the frame-based Transmitter to Interferer Ratio (TIR) increases when evaluated independently of the amount of available segments. Recent work [3] develops a frame-based Spectral Autocorrelation Ratio (SAR) technique for determining usable frames within co-channel speech. The ability of the SAR method to determine usable frames at various thresholds is examined. This paper investigates the effectiveness of frame based usable speech extraction techniques for speaker identification.

9:30, SPEECH-P6.7
LEARNING THE DECISION FUNCTION FOR SPEAKER VERIFICATION
S. BENGIO, J. MARIÉTHOZ
This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks yielded significant performance improvements.

9:30, SPEECH-P6.8
SPEAKER INDEXING IN LARGE AUDIO DATABASES USING ANCHOR MODELS
D. STURIM, D. REYNOLDS, E. SINGER, J. CAMPBELL
This paper introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian Mixture Model with Universal Background Model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.

9:30, SPEECH-P6.9
SPEAKER IDENTIFICATION USING GAUSSIAN MIXTURE MODELS BASED ON MULTI-SPACE PROBABILITY DISTRIBUTION
C. MIYAJIMA, Y. HATTORI, K. TOKUDA, T. MASUKO, T. KOBAYASHI, T. KITAMURA
This paper presents a new approach to modeling speech spectra and pitch for text-independent speaker identification using Gaussian mixture models based on multi-space probability distribution (MSD-GMM). The MSD-GMM allows us to model continuous pitch values for voiced frames and discrete symbols representing unvoiced frames in a unified framework. Spectral and pitch features are jointly modeled by a two-stream MSD-GMM. We derive maximum likelihood (ML) estimation formulae for the MSD-GMM parameters, and the MSD-GMM speaker models are evaluated for text-independent speaker identification tasks. Experimental results show that the MSD-GMM can efficiently model spectral and pitch features of each speaker and outperforms conventional speaker models.

9:30, SPEECH-P6.10
LEARNING STATISTICALLY EFFICIENT FEATURES FOR SPEAKER RECOGNITION
G. JANG, T. LEE, Y. OH
We apply independent component analysis (ICA) for extracting an optimal basis to the problem of finding efficient features for a speaker. The basis functions learned by the algorithm are oriented and localized in both space and frequency, bearing a resemblance to Gabor functions. The speech segments are assumed to be generated by a linear combination of the basis functions, thus the distribution of speech segments of a speaker is modeled by a basis, which is calculated so that each component should be independent upon others on the given training data. The speaker distribution is modeled by the basis functions. To asses the efficiency of the basis functions, we performed speaker classification experiments and compared our results with the conventional Fourier-basis. Our results show that the proposed method is more efficient than the conventional Fourier-based features, in that they can obtain a higher classification rate.