All the systems are based on the ALIZE toolkit, except the acoustic parameterization which is performed using the SPRO software. Description of the UPMC_PRC_BaseLine system -------------------------------- The UPMC_PRC_BaseLine system is a GMM baseline system. 1. Parameterization 1.1 Front end processing: The signal is characterized by 32 coefficients including 16 linear frequency cepstral coefficients (LFCC) and their first derivative coefficients, obtained as follows. 24 filter bank coefficients are first computed over 20ms Hamming windowed frames at a 10ms frame rate. Bandwidth is limited to the 300-3400Hz range. Filter bank coefficients are then converted to 16th order cepstral coefficients using a Discrete Cosine transformation. 1.2 Frame removal Here, the energy coefficients are first normalized using a mean removal and variance normalization in order to fit a 0-mean and 1-variance distribution. They are then used to train a three component GMM, which aims at selecting informative frames. Indeed, X% of the most energized frames are selected through the GMM, with: X=w_1+(merged*alpha*w_2) where w_1 is the weight of the highest (energy) Gaussian component, w_2 is the weight of the middle component, "merged" is an integer ranging from 0 to 1, and "alpha" is a weighting parameter. The value of "merged" is decided according to the likelihood loss when merging the gaussian components 1 and 2 and the components 2 and 3. If the loss is higher for components 1 and 2, "merged" is set to 0 else to 1. "alpha" was empirically fixed to 0.0. Once the speech segments of a signal are selected, a final process is applied in order to refine the speech segmentation: - overlapped speech segments between both the sides of a conversation are removed - morphological rules are applied on speech segments to avoid too short ones, adding or removing some speech frames. 1.3 Parameter normalization Finally, the parameter vectors are normalized to fit a 0-mean and 1-variance distribution. The mean and variance estimators used for the normalization are computed file by file on all the frames kept after applying the frame removal processing. 2. Model Training All the models are based on GMM scheme. 2.1 World models The world modeling relies on three steps: initialization, training and warping. Resulting world models are 512 gender dependent Gaussian Mixture Models with diagonal covariance matrices. Initialization: The initialization step consists in a separation of the acoustic space in n classes. This is made via a random selection of frames. Precisely, for a better separation of initial classes, frames are selected among the entire learning signal via a probability followed by an iteration of the EM algorithm, to estimate the GMM parameters. The system is tuned so that each class is initialized with about 1000 frames. Training: The estimation of the world model parameters, second step of the process, is done thanks to the EM algorithm. For this task, instead of using all the learning signals in their temporal order, we use a probability to select frames randomly. For each world model, 25 iterations of EM are computed as follows: During the first 21 iterations, only 10% of the overall frame number is involved in the EM algorithm; The 10% of frames is selected randomly at each new iteration; During the last 4 iterations, the entire signal is classically used in its temporal order. During all the process, a variance flooring is applied so that no variance value is less than 0.5. Warping: The overall training process, described above, does not respect the normalization hypothesis (induced by the parameterization step), i.e. a global 0-mean and 1-variance distribution. The variance flooring may be a first reason to the mismatch between the obtained distribution and the expected one. To cancel this mismatch, a model warping technique is finally applied on the world model. 2.2 Client model Client models are derived by a Bayesian Adaptation (1 iteration of the Maximum A Posteriori method) of the world model. Only means of each gaussian are adapted. The amount of adaptation for each mean is related to the amount of data available, this is achieved via a relevance factor of 14. Due to this speaker modeling process, the normalization hypothesis is not still respected in the resulting speaker GMM. The mismatch between the obtained distribution and the expected one (0-mean and 1-variance distribution) is mainly caused by the lack of adaptation data. To cancel this mismatch, all the speaker models are normalized, at each iteration, by applying the warping technique. Only the means of GMM models are warped here. 3. Test, normalization, and decision Speaker detection test relies on log-likelihood ratio, computed on the 10 best gaussian components. Classical Tnorm normalization technique is then applied on each test likelihood ratio. Finally, the Tnormed log likelihood scores are compared with a threshold to make the decision. This threshold is gender independent and set on the best DCF point estimated on SRE'05 development set. The threshold for the trials is fixed to 2.47 4. Corpora used The corpora we used for the world training, and for the normalization are the ones covered by the 2006 Nist speaker recognition evaluation license agreement: * NIST'04 corpus * NIST'05 corpus The gender dependent world models are trained using 6h and 8h of speech for male and female models respectively. Gender dependent TNorm populations are extracted from the NIST'04 corpus: 76 speakers for the male population and 113 for the female populations. 5. Computational time All the computations was done in a single 3GHz processor. Gender-dependent world model: 23h00 Gender-dependent TNorm model: 4h Client model train (1conv4w): 22h00 Trials (1conv4w): 34h00 Tests TNorm (1conv4w): 53h ************************************************************************************************************************ Description of the UPMC_PRC_EvolFilerBank system ************************************************************************************************************************ The difference between the UPMC_PRC_BaseLine system and the UPMC_PRC_EvolFilerBank consist in using a different parameterization method. 1. Parameterization 1.1 Front end processing: The signal is characterized by 32 coefficients including 16 frequency cepstral coefficients (LFCC) and their first derivative coefficients, obtained as follows: 24 filter bank coefficients are first computed over 20ms Hamming windowed frames at a 10ms frame rate. Bandwidth is limited to the 300-3400Hz range. The particularity of this feature extractor is the fact that the filter bank used was design in order to produces features decorrelated (as possible) to the features obtain with a linear scaled filter bank. The following table describes respectively the center frequencies and the band widths of each triangular filter in the bank. This filter bank was obtained by a data driven algorithm. Center Band width (Hz) frequency (Hz) 392.2 188.2353 439.2 31.3726 502.0 219.6078 596.1 156.8628 752.9 470.5883 941.2 282.3529 1051.0 188.2353 1129.4 156.8627 1239.2 282.3529 1364.7 250.9804 1490.2 282.3529 1600.0 156.8628 1709.8 282.3529 1851.0 282.3529 1976.5 219.6078 2117.6 345.0980 2258.8 219.6079 2447.1 564.7059 2635.3 188.2353 2760.8 345.0980 2949.0 407.8431 3074.5 125.4902 3152.9 188.2353 3262.7 282.3530 Filter bank coefficients are then converted to 16th order cepstral coefficients using a Discrete Cosine transformation. All the rest of this system is strictly identical to the UPMC_PRC_BaseLine system, except for the threshold whose is fixed to 2.59 ************************************************************************************************************************ Description of the UPMC_PRC_PRIMARY system ************************************************************************************************************************ This system is a fusion of the UPMC_PRC_BaseLine and the UPMC_PRC_EvolFilerBank systems. This fusion consists in a weighted sum of the both system's outputs. The weights are respectively 0.8 and 0.2 for the UPMC_PRC_BaseLine and the UPMC_PRC_EvolFilerBank. They are determined empirically in order to minimize the DCF cost on the Nist'05 evaluation data base. For this system, the threshold is fixed to 2.45.