2006 Speaker Recognition Evaluation
                LIMSI-CNRS primary system description


The LIMSI-CNRS primary speaker recognition system is built upon six sub-systems,
including 2 MFCC-GMM, 2 MFCC-SVM and 2 MLLR-SVM sub-systems. 


1) MFCC-GMM sub-systems
=======================
The LIMSI's MFCC-GMM sub-systems, namely forward and backward sub-systems, are GMM-based 
systems with MAP-adapted UBMs. In the forward approach, extracted features of 
test speech  are matched with GMM models trained on training speech (i.e. 
conventional  approach). In backward approach, extracted features of training 
speech is matched with GMM models trained on test speech. Feature normalization
is performed using feature mapping and feature warping. Score normalization is 
performed with T-norm.

Data
----
Gender-dependent UBMs is trained using SRE'00 (landline), SRE'01 (cellular) and
SRE'03 (cellular) evaluation data. T-norm was performed using 500 speakers from 
Fisher Corpus.

Features
--------
PLP-like features are extracted from the speech signal every 10ms using a
30ms window, and estimated on the 0-3.8kHz bandwidth.  The feature vector
consists of 15 MEL-PLP cepstrum coefficients, 15 delta coefficients plus
the delta energy, and 15 delta-delta coefficients plus the delta-delta energy 
for a total of 47 features.

Frame selection
---------------
Frame selection is performed in two approaches. For SRE'06 training and test
data, the ASR transcription provided by NIST is used to select the segments 
containing speech; Segments containing speech in the opposite channel are 
further filtered out. For the SRE'00, SRE'01 and SRE'02 data used in UBM
training and the Fisher data used in T-norm, a 2-state HMM speech detector is
used instead of the ASR transcription; the "silence" state is modeled by a 
512-mixture GMM whereas the "speech" state is modeled by a 2048-mixture GMM. 
The feature vector is composed of 12 MEL-PLP cepstrum coefficients, 12 delta and
12 delta-delta plus delta energy and delta-delta energy (i.e. 38 coefficients). 
In the second step, in any case, 10% of the remaining frames with lowest energy 
are filtered out.

Normalization
-------------
Channel compensation for cellular, landline carbon and landline electret
data is performed for both genders using feature mapping [1]. A root GMM is
trained on a subset of the UBM data, and its means are MAP adapted for each
of the channel and gender conditions. After feature mapping, feature
warping [2] is then performed over a sliding window of about 3 seconds.

UBM
---
Each of the two gender-dependent background models is a
1536-mixture GMM, formed by merging three GMMs, each with 512 Gaussians
trained on cellular, landline electret and landline carbon data.

Speaker Modeling
----------------
For each speaker, a speaker-specific Gaussian mixture model (GMM)
was trained by MAP adaptation of the Gaussian means of the corresponding gender
background model using 3 iterations of the EM algorithm.

Scoring
-------
Each test segment was scored against each proposed target model. 
For a given test segment X and a target model M, the decision score S(X,M) 
is computed as follows:

S(X,M) = [log f'(X|M) - mu(X)] / sigma(X)

where f'(X|M) is the likelihood of the speech segment for a given model
normalized by the segment duration L(X), i.e. f'(X|M)=f(X|M)^(1/L(X)), and
mu(X) and sigma(X) are the mean and standard deviation of the likelihoods
of the scores of test segment X against a set of impostor cohort models,
following the T-norm method [3]. For each test segment, the gender-matching
speakers from the T-norm set were used and only the 90% best scores were
used for computing the statistics mu(X) and sigma(X). The gender-dependent
UBM was used to select 20 top Gaussians.  The target model and and TNorm
models were only scored according to the 20 top Gaussians. 


2) MFCC-SVM sub-systems
=======================
The LIMSI's MFCC-SVM speaker recognition sub-systems are bidirectional train-test 
SVM-based modeling system which includes polynomial feature extraction, feature 
reduction and normalization in the feature extraction process.

Data, Features, Frame Selection and Normalization sections are exactly the
same as for the MFCC-GMM.

Polynomial Feature Extraction
-----------------------------
PLP-like features are first transformed into high-dimensional polynomial
features and variance normalized afterwards. These are computed as an up to
n-th order monomial expansion of the PLP ones, variance normalized and
finally averaged over the whole sentence to make up a single high dimensional
vector. We used up to a 3rd order expansion, meaning that, the cepstral
features were appended together with the second and third order ones. For the
submitted feature set-up this made up 20824-D vectors.

Kernel Principal Component Analysis
-----------------------------------
Dimensionality reduction is performed by means of Kernel PCA, taking all of
the eigenvectors, that is, 3197 (the number of training samples minus one).
The used kernel is a 0-offset cumulative version of the polynomial kernel.
The training data for KPCA is the impostor speaker set.

Min-Max Normalization
---------------------
A linear transformation is applied to the features to fit them into the
range [-1,1]. The minimum and maximum values is taken from the impostor
speaker set.

Impostor Speaker Set
--------------------
An impostor speaker set is collected for our discriminant modeling
framework. These were taken from training data of past SRE NIST evaluations,
from years 1999, 2000, 2001, 2002 and 2004, making up a total of 3198
speakers.

Speaker Modeling and Scoring
----------------------------
A speaker-specific linear soft-SVM model is trained for each of the target
speakers, for later test-vs-train scoring. In a similar way, a linear
soft-SVM model is trained for each of the test speakers, for later
train-vs-test scoring. This results in two separate score files which are
later to be combined by the score fusion system.


3) MLLR-SVM sub-systems
=======================
The LIMSI's MLLR-SVM speaker recognition sub-systems are bidirectional 
train-test SVM-based modeling system with normalized MLLR transforms as 
features.

This system is exactly the same as the MFCC-SVM, except for the MLLR feature
extraction. The rest of section remains the same.

Maximum-Likelihood Linear Regression Features
---------------------------------------------
In an iterative manner, background speaker cepstral features are used to train a
GMM-UBM model. Single-class CMLLR (tied mean and variance) speaker adaptation is
used to adapt the background, target and test speaker cepstral features to the 
UBM model. The resulting matrix is stacked column-wise and the offset vector is
appended to it. This results in 2256-D feature vectors. [-1,1] Min-Max 
normalization is applied, using the SVM impostor speaker set to estimate the 
normalization parameters. New cepstral files are computed using the existing 
MLLR transforms for the background speakers, and a new UBM model is trained.
This process is iterated 4 times.


4) Score fusion of sub-systems
==============================
The score from each sub-system is normalized using the score statistics in 
SRE'05 trials and an average mean of all the 6 scores is calculated. The 
decision threshold on the mean score was optimized so as to obtain the lowest 
cost function on the SRE'05 data.


References
----------
[1] D. Reynolds, "Channel robust speaker verification via feature
mapping", ICASSP 2003.
[2] J. Pelecanos & S. Sridharan, "Feature warping for speaker verification",
proc. Odyssey 2001.
[3] P. Auckenthaler, M. Carey, and H. Lloyd-Thomas, "Score normalization
for text-independent speaker verification systems", Digital Signal
Processing, vol. 10, pp. 42--54, 2000
[4] S. S. Kajarekar, "Four weighting and a fusion: A cepstral-SVM system
for speaker recognition", ICASSP 2003.
[5] B. Scholkopf, A. Smola, K. R. Muller "Kernel Principal Component
Analysis", Advances in Kernel Methods-Support Vector Learning, 1999
[6] A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, A. Venkataraman "MLLR transforms
as features in speaker recognition", InterSpeech 2005