2006 Speaker Recognition Evaluation LIMSI-CNRS primary system description The LIMSI-CNRS primary speaker recognition system is built upon six sub-systems, including 2 MFCC-GMM, 2 MFCC-SVM and 2 MLLR-SVM sub-systems. 1) MFCC-GMM sub-systems ======================= The LIMSI's MFCC-GMM sub-systems, namely forward and backward sub-systems, are GMM-based systems with MAP-adapted UBMs. In the forward approach, extracted features of test speech are matched with GMM models trained on training speech (i.e. conventional approach). In backward approach, extracted features of training speech is matched with GMM models trained on test speech. Feature normalization is performed using feature mapping and feature warping. Score normalization is performed with T-norm. Data ---- Gender-dependent UBMs is trained using SRE'00 (landline), SRE'01 (cellular) and SRE'03 (cellular) evaluation data. T-norm was performed using 500 speakers from Fisher Corpus. Features -------- PLP-like features are extracted from the speech signal every 10ms using a 30ms window, and estimated on the 0-3.8kHz bandwidth. The feature vector consists of 15 MEL-PLP cepstrum coefficients, 15 delta coefficients plus the delta energy, and 15 delta-delta coefficients plus the delta-delta energy for a total of 47 features. Frame selection --------------- Frame selection is performed in two approaches. For SRE'06 training and test data, the ASR transcription provided by NIST is used to select the segments containing speech; Segments containing speech in the opposite channel are further filtered out. For the SRE'00, SRE'01 and SRE'02 data used in UBM training and the Fisher data used in T-norm, a 2-state HMM speech detector is used instead of the ASR transcription; the "silence" state is modeled by a 512-mixture GMM whereas the "speech" state is modeled by a 2048-mixture GMM. The feature vector is composed of 12 MEL-PLP cepstrum coefficients, 12 delta and 12 delta-delta plus delta energy and delta-delta energy (i.e. 38 coefficients). In the second step, in any case, 10% of the remaining frames with lowest energy are filtered out. Normalization ------------- Channel compensation for cellular, landline carbon and landline electret data is performed for both genders using feature mapping [1]. A root GMM is trained on a subset of the UBM data, and its means are MAP adapted for each of the channel and gender conditions. After feature mapping, feature warping [2] is then performed over a sliding window of about 3 seconds. UBM --- Each of the two gender-dependent background models is a 1536-mixture GMM, formed by merging three GMMs, each with 512 Gaussians trained on cellular, landline electret and landline carbon data. Speaker Modeling ---------------- For each speaker, a speaker-specific Gaussian mixture model (GMM) was trained by MAP adaptation of the Gaussian means of the corresponding gender background model using 3 iterations of the EM algorithm. Scoring ------- Each test segment was scored against each proposed target model. For a given test segment X and a target model M, the decision score S(X,M) is computed as follows: S(X,M) = [log f'(X|M) - mu(X)] / sigma(X) where f'(X|M) is the likelihood of the speech segment for a given model normalized by the segment duration L(X), i.e. f'(X|M)=f(X|M)^(1/L(X)), and mu(X) and sigma(X) are the mean and standard deviation of the likelihoods of the scores of test segment X against a set of impostor cohort models, following the T-norm method [3]. For each test segment, the gender-matching speakers from the T-norm set were used and only the 90% best scores were used for computing the statistics mu(X) and sigma(X). The gender-dependent UBM was used to select 20 top Gaussians. The target model and and TNorm models were only scored according to the 20 top Gaussians. 2) MFCC-SVM sub-systems ======================= The LIMSI's MFCC-SVM speaker recognition sub-systems are bidirectional train-test SVM-based modeling system which includes polynomial feature extraction, feature reduction and normalization in the feature extraction process. Data, Features, Frame Selection and Normalization sections are exactly the same as for the MFCC-GMM. Polynomial Feature Extraction ----------------------------- PLP-like features are first transformed into high-dimensional polynomial features and variance normalized afterwards. These are computed as an up to n-th order monomial expansion of the PLP ones, variance normalized and finally averaged over the whole sentence to make up a single high dimensional vector. We used up to a 3rd order expansion, meaning that, the cepstral features were appended together with the second and third order ones. For the submitted feature set-up this made up 20824-D vectors. Kernel Principal Component Analysis ----------------------------------- Dimensionality reduction is performed by means of Kernel PCA, taking all of the eigenvectors, that is, 3197 (the number of training samples minus one). The used kernel is a 0-offset cumulative version of the polynomial kernel. The training data for KPCA is the impostor speaker set. Min-Max Normalization --------------------- A linear transformation is applied to the features to fit them into the range [-1,1]. The minimum and maximum values is taken from the impostor speaker set. Impostor Speaker Set -------------------- An impostor speaker set is collected for our discriminant modeling framework. These were taken from training data of past SRE NIST evaluations, from years 1999, 2000, 2001, 2002 and 2004, making up a total of 3198 speakers. Speaker Modeling and Scoring ---------------------------- A speaker-specific linear soft-SVM model is trained for each of the target speakers, for later test-vs-train scoring. In a similar way, a linear soft-SVM model is trained for each of the test speakers, for later train-vs-test scoring. This results in two separate score files which are later to be combined by the score fusion system. 3) MLLR-SVM sub-systems ======================= The LIMSI's MLLR-SVM speaker recognition sub-systems are bidirectional train-test SVM-based modeling system with normalized MLLR transforms as features. This system is exactly the same as the MFCC-SVM, except for the MLLR feature extraction. The rest of section remains the same. Maximum-Likelihood Linear Regression Features --------------------------------------------- In an iterative manner, background speaker cepstral features are used to train a GMM-UBM model. Single-class CMLLR (tied mean and variance) speaker adaptation is used to adapt the background, target and test speaker cepstral features to the UBM model. The resulting matrix is stacked column-wise and the offset vector is appended to it. This results in 2256-D feature vectors. [-1,1] Min-Max normalization is applied, using the SVM impostor speaker set to estimate the normalization parameters. New cepstral files are computed using the existing MLLR transforms for the background speakers, and a new UBM model is trained. This process is iterated 4 times. 4) Score fusion of sub-systems ============================== The score from each sub-system is normalized using the score statistics in SRE'05 trials and an average mean of all the 6 scores is calculated. The decision threshold on the mean score was optimized so as to obtain the lowest cost function on the SRE'05 data. References ---------- [1] D. Reynolds, "Channel robust speaker verification via feature mapping", ICASSP 2003. [2] J. Pelecanos & S. Sridharan, "Feature warping for speaker verification", proc. Odyssey 2001. [3] P. Auckenthaler, M. Carey, and H. Lloyd-Thomas, "Score normalization for text-independent speaker verification systems", Digital Signal Processing, vol. 10, pp. 42--54, 2000 [4] S. S. Kajarekar, "Four weighting and a fusion: A cepstral-SVM system for speaker recognition", ICASSP 2003. [5] B. Scholkopf, A. Smola, K. R. Muller "Kernel Principal Component Analysis", Advances in Kernel Methods-Support Vector Learning, 1999 [6] A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, A. Venkataraman "MLLR transforms as features in speaker recognition", InterSpeech 2005