LIMSI-CNRS (Orsay, France) and KIT (Karlsruhe, Germany), already collaborating on speaker recognition tasks within the French-founded QUAERO project, joined their effort with CMU (Pittsburgh, USA) for a CMU+LIMSI+KIT submission to the NIST SRE 2010 core condition.

1) MFCC-GMM system: three systems (LS1, LS2, LS3) which differ only by the corpora used for training and normalization:

  - Voiced frames are detected by using pitch detection (ESPS get_f0) combined with an energy threshold
  - 15 PLP + 15 Delta PLP + 15 Delta-Delta PLP + 1 Delta Energy + 1 Delta-Delta Energy (47 features)
  - Feature warping
  - Gaussian mixture model
        - 256 gaussians
        - gender-dependent UBMs were trained using
	  + SRE04 telephone data (LS1)
          + SRE04 telephone data + SRE05, SRE06 microphone data (LS2)
          + SRE05, SRE06 microphone data (LS3)
        - MAP-adaptation of gender-dependent UBMs
  - Gender-dependent factor analysis with ALIZE toolkit
        - Rank 40, trained using
          + SRE06 telephone data (LS1): 713 speakers (407f + 306m)
          + SRE05 microphone, SRE06 telephone + microphone data (LS2): 890 speakers (502f + 388m) 
          + SRE05, SRE06 microphone (LS3): 177 speakers (95f + 82m)
        - Symmetrical Compensation which both performs in model domain and feature domain
  - Normalization
        - Gender-dependent zt-norm using 
          + SRE06 telephone data (LS1): 276 female segments + 218 male segments
          + SRE06 telephone data, SRE05+SRE06 microphone data (LS2): 466 female segments + 382 male segments
          + SRE05+SRE06 microphone data (LS3): 285 female segments + 246 male segments

2) CMU MFCC-GMM system: two systems (CS1, CS2) which differ only by the used different front-end features and VAD strategies:
 
  - Front-end Features:
       o CS1: 20 MFCC + 20 Delta MFCC
       o CS2: 12 MFCC (removing C0 from conventional 13-dimensional MFCC) + 12 Delta MFCC
  - VAD strategy:
       o for telephone conversions:
          + CS1: simple energy based VAD strategy: the bottom 30% of the frames are excluded
                 as non-speech according to energy magnitude.
          + CS2: fitting Gaussian mixture model with 3 mixtures based on the C0 feature and decide the 
                 non-speech part for which C0 feature is less than the mean of the middle Gaussian, 
                 where Gaussian mixtures ordered according to increasing means.
       o for interview conversation:
          + both systems use the VAD information provided by NIST 
  - Feature warping, implementation as in [2]
  - Gaussian mixture model
       o 1024 gaussians
       o gender-dependent UBMs were trained using SRE06 training data for both CS1 and CS2
  - Gender-dependent Joint Factor Analysis, theory and implementation follows many other publications [3-6].
       o 300 Eigenvoice factors and 100 Eigenchannel factors are used for both telephone and microphone conditions.
       o using subsets of SRE04, SRE05 and SRE06 training data, where we collect utterances from speakers 
         with more than 8 conversations (8conv). 
       o using SRE05 auxiliary microphone data for the Eigenchannel training for interview condition. 
  - Normalization
       o Gender-dependent z-norm using SRE05 training data helped on the development data. 
         However, we were not able to include it in the final submission due to resources limitations 

3) GSV-SVM system:

 - SAD using voiced frames detection (similar to LS1-LS3 systems)
 - 15 PLP + 15 Delta PLP + 15 Delta-Delta PLP + 1 Delta Energy + 1 Delta-Delta Energy (47 features)
 - Feature warping
 - Feature mapping
 - Gaussian mean supervectors as features
       - 256 gaussians
       - Variance normalization
       - Gender dependent
 - NAP, 50 dimensions projected out
 - Min-max feature normalization
 - Linear Kernel SVM classifier

4) Score-level System Fusion for the CLIK Primary System

Linear logistic regression as described in [1] is used in the score-level system fusion. The English short2-short3 core trials of the NIST SRE08 are used to train the fusion weights and the decision threshold values. The system fusion is performed separately for 3 common evaluation conditions: interview-interview, interview-telephone and telephone-telephone. We note that for both SRE'08 and SRE'10 training and test data, we regroup into the same condition (named interview) the interview speech segments (interview/[3|8]min) and the conversational telephone speech segments recorded in microphone channel (phonecall/mic).

Due to the failure of speech detection in training or test speech segment, or lack of transcription, some sub-systems cannot provide valid scores for certain trials. Due to this reason, an affine calibration transformation is firstly performed on the scores of each sub-system so that the transformed scores can be interpreted as log-likelihood-ratios. The SRE08 short2-short3 core trials which each sub-system can provide valid scores are used as the training data for the transformation. After this score transformation, the trials without valid scores were replaced with scores of zero and then the fusion can be trained as usual.

5) Score-level System Fusion for the CLIK alternate System

The same linear logistic regression was used for the alternate system. The difference between the fusion in the primary system is that for the trials without valid scores from some of the subsystems, we set the score as zero at the final step.

References

[1] N. Brummer et al, "Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006", IEEE Trans. on Audio, Speech, and Language Processing, Vol. 15 (7), p. 2072-2084, 2007.

[2] B. Xiang, J. Navratil, G. Ramaswamy, and R. Gopinath, "Short-time Gaussianization for Robust Speaker Vereification", ICASSP, 2002.

[3] P. Kenny, P. Ouellet, N. Dehak, V. Gupta and P. Dumouchel, "A Study of Inter-Speaker Variability in Speaker Verification", IEEE Trans. on Audio, Speech and Language Processing, 2008

[4] L. Burget, P. Matejka, V. Hubeika and J. Cernocky, "Investigation into variants of Joint Factor Analysis for speaker recognition", Interspeech 2009.

[5] D. Matrouf, N. Scheffer, B. Fauve, J. Bonastre, "A Straightforward and Efficient Implementation of the Factor Analysis Model for Speaker Verification", Interspeech 2007

[6] R. Vogt, S. Sridharan, "Explicit Modelling of Session Variability for Speaker Verification", Computer Speech and Language, Volume 22, Issue 1, pages 17-38, January, 2008.