Description of BUT systems for SRE-06 - Speaker Recognition
evaluation
====================================================================

Pavel Matejka, Lukas Burget, Petr Schwarz, Ondrej Glembek, Martin
Karafiat, Frantisek Grezl and Jan Cernocky

Speech@FIT, Faculty of Information Technology,
Brno University of Technology, Czech Republic

- BUT01:primary: fusion of 6 systems: GMM, SVM-GMM and CMMLR+MLLR
systems, all in two variants: with and without t-norm

- BUT02: fusion of 3 systems: GMM, SVM-GMM and CMMLR+MLLR systems,
all with t-norm

- BUT03: GMM without t-norm.

###########################################
#################  GMM  ###################
###########################################
Segmentation:

Speech/silence segmentation is performed by our Hungarian phoneme
recognizer [1,2], where all phoneme classes are linked to 'speech'
class. Segments labelled 'speech' or 'silence' are generated. The
postprocessing with two rules based on short time energy of the
signal is applied: (1) If the average energy in 'speech' segment
is 30 dB less than the maximum energy from the utterance, then
segment is labelled as silence. (2) If the energy in the other
channel is bigger than maximum energy minus 3dB in the processed
channel, the segment is also labelled as silence.

Feature extraction:

12 MFCC coefficients plus C0 are computed and cepstral mean
substraction, and short time gaussianization over 300 frames are
applied. RASTA filtering of the features is used. First, second
and third order derivatives over 5 frames are appended to each
feature vector, which results in dimensionality 52. Gaussian
mixture model (GMM) with 2048 Gaussian components is trained on
this data. HLDA (where individual GMM mixture components are
considered as classes) is used to decorrelate the features and
reduce the dimensionality from 52 to 39.

Universal background model (UBM) with 2048 Gaussians is trained on
features projected HLDA space. Feature Mapping with 14 models
adapted from UBM for different conditions is used: 6 models were
adapted for 3 channels (cell,cord,stnd) and 2 genders given the
labels from 2004 test data. Remaining 8 models were initially
adapted for 4 channels (cdma,cord,elec,gsmc) and 2 genders on
labels obtained from TNO's channel recognition output (from 2005
SRE). However, these 8 models were then iteratively used to
re-cluster the training data in unsupervised fashion and again
adapted using the new clustering (20 iteration lead to stability).

Modeling and testing

GMM models adapted from UBM by MAP-adaptation were used to model
the target speakers (only means were adapted). Relevance factor 19
was used for the MAP adaptation. For each trial, the target model
and UBM are adapted to channel of test segment using simple
channel adaptation, where mean super-vector M is adapted to M_a ->
M + Vx. Here, V is eigenchannel space matrix estimated the same
way as for SVM-GMM system (see description below). Weight vector
x, which is considered to be normally distributed, is obtained by
maximizing probability p(data| M + Vx) p(x) using one iteration of
EM algorithm.

The final score for each trial is given by log likelihood ratio:
log p(data | M_a) - log p(data | UBM_a), where data is the test
segment, M_a is channel adapted target speaker model and UBM_a is
channel adapted UBM model.

Note, that for T-normalized version of GMM system, each T-norm
model is also adapted to channel of tested segment.

Real time factor for training: 4.0
Real time factor for test:     4.0

Memory requirement for testing is 200MB when all t-norm models are
loaded to the memory.

###########################################
################  SVM-GMM  ################
###########################################

This system was also based on GMMs but their means were classified
by support-vector machines (SVM).

GMM Modeling:

The feature extraction and UBM training was done in the same way
as above, but the UBM has only 512 Gaussian components. Means of
UBM are adapted using MAP-adaptation to each training, test and
background segment. (relevance factor 19 was used in MAP
adaptation). Each training, test and background segment, is
represented by means of its Gaussian components: each mean is
normalized by the corresponding standard deviation. All normalized
mean vectors of all GMM mixture components are then concatenated
to form one super-vector.

Nuisance attribute projection (NAP) is used to remove unwanted
channel variability. NAP is based on eigen-channel spaces given by
eigenvectors of average within class covariance matrix, where each
class is represented by super-vectors estimated on different
segments spoken by the same speaker (estimated on SRE 2004 data).
Also, rank normalization was used (the target feature distribution
was trained on impostor speakers data).

SVM training and classification:

The SVM used to classify mean super-vectors uses linear kernel. It
is trained on one positive example from the target speaker. The
negative examples are taken from 2002 SRE data (260 speakers) and
from Fisher1 (2606 speakers). In the testing, the trial is scored
by the respective SVM. The SVM training and scoring was built with
LibSVM [3] library.

Real time factor for training: 1.0
Real time factor for test:     1.0

Memory requirement for testing is 10MB.


###########################################
#############  CMLLR+MLLR  ################
###########################################

In this system, the coefficients from constrained maximum
likelihood linear regression (CMLLR) and maximum likelihood linear
regression (MLLR) transforms estimated in an ASR system are
classified by SVMs.

Segmentation:

In this system, we used segmentation from ASR transcripts provided
by NIST.

Feature Extraction:

The ASR features are PLP with C0, and delta coefficients up to
third order, cepstral mean and variance normalization, HLDA
(dimensionality reduction from 52 to 39).

The recognition system:

The core of AMI system submitted to NIST RT 2005 [4] was used in
MLLR/CMMLR work. However, because the lack of time, we did not
generate our own ASR transcriptions, but used the ASR output
provided by NIST. Since NIST did not provide pronunciation
dictionary, we used the AMI dictionary and we generated the
missing pronunciations automatically. With this, we were able to
generate the triphone alignment and to apply VTLN.

CMLLR/MLLR:

CMLLR and MLLR transforms are trained for each speaker. At first,
CMLLR is trained with two classes (speech + silence). On the top
of it, MLLR with three classes (2 speech classes obtained by
automatic clustering on  the ASR training data + silence) is
estimated. Both CMLLR and MLLR transform matrices were estimated
as block-diagonal in 13-coefficient wide streams (reminiscence
from originally used MFCCs).

SVM Preprocessing:

The transform matrices from CMLLR speech classes (3x13x13x1+39)
and MLLR (3x13x13x2+2x39) are concatenated to one vector with 1638
features. NAP is estimated on all useable SRE 2004 data (1-side,
3-side, 8-side and 16side) and rank normalization is trained on
impostor speakers.

SVM:

The same SVM classification as above is used. The impostor data
(310 speakers) was taken from NIST 2004, as this contains the ASR
transcripts.

Real time factor for training: 2.5
Real time factor for test:     2.5

Memory requirement for testing is 500MB.

###########################################
################  Fusion  #################
###########################################

We have used fusion based on logistic linear regression developed
by Niko Brummer [5]. We are grateful to Niko for Matlab
implementation of this function.


References:

[1] Schwarz P., Matejka P. and Cernocky J.: Hierarchical
Structures of Neural Networks for Phoneme Recognition, In
Proceedings of ICASSP 2006, May 2006, Toulouse, France

[2] Matejka P., Burget L., Schwarz P. and Cernocky J., Brno
University of Technology System for NIST 2005 Language Recognition
Evaluation. Odyssey: The Speaker and Language Recognition
Workshop, San Juan, Puerto Rico,Jun 2006

[3] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for
support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm

[4] T. Hain et al.: The 2005 AMI system for the transcription of
speech in meetings, in Proc.  Rich Transcription 2005 Spring
Meeting Recognition Evaluation Workshop, Edinburgh, July 2005.

[5] Niko Brummer and Johan du Preez: Application-Independent
Evaluation of Speaker Detection, to be published, Computer Speech
and Language, 2005.