Description of BUT systems for SRE-06 - Speaker Recognition evaluation ==================================================================== Pavel Matejka, Lukas Burget, Petr Schwarz, Ondrej Glembek, Martin Karafiat, Frantisek Grezl and Jan Cernocky Speech@FIT, Faculty of Information Technology, Brno University of Technology, Czech Republic - BUT01:primary: fusion of 6 systems: GMM, SVM-GMM and CMMLR+MLLR systems, all in two variants: with and without t-norm - BUT02: fusion of 3 systems: GMM, SVM-GMM and CMMLR+MLLR systems, all with t-norm - BUT03: GMM without t-norm. ########################################### ################# GMM ################### ########################################### Segmentation: Speech/silence segmentation is performed by our Hungarian phoneme recognizer [1,2], where all phoneme classes are linked to 'speech' class. Segments labelled 'speech' or 'silence' are generated. The postprocessing with two rules based on short time energy of the signal is applied: (1) If the average energy in 'speech' segment is 30 dB less than the maximum energy from the utterance, then segment is labelled as silence. (2) If the energy in the other channel is bigger than maximum energy minus 3dB in the processed channel, the segment is also labelled as silence. Feature extraction: 12 MFCC coefficients plus C0 are computed and cepstral mean substraction, and short time gaussianization over 300 frames are applied. RASTA filtering of the features is used. First, second and third order derivatives over 5 frames are appended to each feature vector, which results in dimensionality 52. Gaussian mixture model (GMM) with 2048 Gaussian components is trained on this data. HLDA (where individual GMM mixture components are considered as classes) is used to decorrelate the features and reduce the dimensionality from 52 to 39. Universal background model (UBM) with 2048 Gaussians is trained on features projected HLDA space. Feature Mapping with 14 models adapted from UBM for different conditions is used: 6 models were adapted for 3 channels (cell,cord,stnd) and 2 genders given the labels from 2004 test data. Remaining 8 models were initially adapted for 4 channels (cdma,cord,elec,gsmc) and 2 genders on labels obtained from TNO's channel recognition output (from 2005 SRE). However, these 8 models were then iteratively used to re-cluster the training data in unsupervised fashion and again adapted using the new clustering (20 iteration lead to stability). Modeling and testing GMM models adapted from UBM by MAP-adaptation were used to model the target speakers (only means were adapted). Relevance factor 19 was used for the MAP adaptation. For each trial, the target model and UBM are adapted to channel of test segment using simple channel adaptation, where mean super-vector M is adapted to M_a -> M + Vx. Here, V is eigenchannel space matrix estimated the same way as for SVM-GMM system (see description below). Weight vector x, which is considered to be normally distributed, is obtained by maximizing probability p(data| M + Vx) p(x) using one iteration of EM algorithm. The final score for each trial is given by log likelihood ratio: log p(data | M_a) - log p(data | UBM_a), where data is the test segment, M_a is channel adapted target speaker model and UBM_a is channel adapted UBM model. Note, that for T-normalized version of GMM system, each T-norm model is also adapted to channel of tested segment. Real time factor for training: 4.0 Real time factor for test: 4.0 Memory requirement for testing is 200MB when all t-norm models are loaded to the memory. ########################################### ################ SVM-GMM ################ ########################################### This system was also based on GMMs but their means were classified by support-vector machines (SVM). GMM Modeling: The feature extraction and UBM training was done in the same way as above, but the UBM has only 512 Gaussian components. Means of UBM are adapted using MAP-adaptation to each training, test and background segment. (relevance factor 19 was used in MAP adaptation). Each training, test and background segment, is represented by means of its Gaussian components: each mean is normalized by the corresponding standard deviation. All normalized mean vectors of all GMM mixture components are then concatenated to form one super-vector. Nuisance attribute projection (NAP) is used to remove unwanted channel variability. NAP is based on eigen-channel spaces given by eigenvectors of average within class covariance matrix, where each class is represented by super-vectors estimated on different segments spoken by the same speaker (estimated on SRE 2004 data). Also, rank normalization was used (the target feature distribution was trained on impostor speakers data). SVM training and classification: The SVM used to classify mean super-vectors uses linear kernel. It is trained on one positive example from the target speaker. The negative examples are taken from 2002 SRE data (260 speakers) and from Fisher1 (2606 speakers). In the testing, the trial is scored by the respective SVM. The SVM training and scoring was built with LibSVM [3] library. Real time factor for training: 1.0 Real time factor for test: 1.0 Memory requirement for testing is 10MB. ########################################### ############# CMLLR+MLLR ################ ########################################### In this system, the coefficients from constrained maximum likelihood linear regression (CMLLR) and maximum likelihood linear regression (MLLR) transforms estimated in an ASR system are classified by SVMs. Segmentation: In this system, we used segmentation from ASR transcripts provided by NIST. Feature Extraction: The ASR features are PLP with C0, and delta coefficients up to third order, cepstral mean and variance normalization, HLDA (dimensionality reduction from 52 to 39). The recognition system: The core of AMI system submitted to NIST RT 2005 [4] was used in MLLR/CMMLR work. However, because the lack of time, we did not generate our own ASR transcriptions, but used the ASR output provided by NIST. Since NIST did not provide pronunciation dictionary, we used the AMI dictionary and we generated the missing pronunciations automatically. With this, we were able to generate the triphone alignment and to apply VTLN. CMLLR/MLLR: CMLLR and MLLR transforms are trained for each speaker. At first, CMLLR is trained with two classes (speech + silence). On the top of it, MLLR with three classes (2 speech classes obtained by automatic clustering on the ASR training data + silence) is estimated. Both CMLLR and MLLR transform matrices were estimated as block-diagonal in 13-coefficient wide streams (reminiscence from originally used MFCCs). SVM Preprocessing: The transform matrices from CMLLR speech classes (3x13x13x1+39) and MLLR (3x13x13x2+2x39) are concatenated to one vector with 1638 features. NAP is estimated on all useable SRE 2004 data (1-side, 3-side, 8-side and 16side) and rank normalization is trained on impostor speakers. SVM: The same SVM classification as above is used. The impostor data (310 speakers) was taken from NIST 2004, as this contains the ASR transcripts. Real time factor for training: 2.5 Real time factor for test: 2.5 Memory requirement for testing is 500MB. ########################################### ################ Fusion ################# ########################################### We have used fusion based on logistic linear regression developed by Niko Brummer [5]. We are grateful to Niko for Matlab implementation of this function. References: [1] Schwarz P., Matejka P. and Cernocky J.: Hierarchical Structures of Neural Networks for Phoneme Recognition, In Proceedings of ICASSP 2006, May 2006, Toulouse, France [2] Matejka P., Burget L., Schwarz P. and Cernocky J., Brno University of Technology System for NIST 2005 Language Recognition Evaluation. Odyssey: The Speaker and Language Recognition Workshop, San Juan, Puerto Rico,Jun 2006 [3] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm [4] T. Hain et al.: The 2005 AMI system for the transcription of speech in meetings, in Proc. Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop, Edinburgh, July 2005. [5] Niko Brummer and Johan du Preez: Application-Independent Evaluation of Speaker Detection, to be published, Computer Speech and Language, 2005.