CST System Description

- Front end processing

In this system, the speech signal is parameterized into 16-dimension MFCC plus delta, which is computed with 20ms Hamming windowed frames as a 10 ms frame rate. Bandwith is limited to the 100-3800Hz range. Filter bank coefficients are then converted to 16th order cepstral coefficients using a Discrete Cosine transformation. Then a pitch-strength based VAD is performed on the speech signal. Finally, the parameter vectors are normalized to fit a 0-mean and 1-variance distribution after the VAD processing.

- Speech segmentation and clustering

In this system, first, a speaker segmentation algorithm based on log likelihood ratio score (LLRS) over universal background model (UBM) is used. During a short period, using the score difference of two speech segments matching against UBM to decide whether these two speech segments belong to one speaker. Then a speaker clustering algorithm based on differential model scores is performed. After speech segmentation, the score difference between speaker models is computed on each speech segment. Each time, select the speech segment which has the maximum differential score, to update the specific speaker model.

- Background model

This system is based on GMM-UBM. Two gender-dependent UBMs are trained for male and female speakers, respectively, each of which consists of 1,204 Gaussian components. The UBMs are trained with transmission channel/microphone type balanced data selected from NIST SREĄŻ04 dataset. The training data for each of the UBMs contains around 316M MFCC features (after silence removal).

- Training

Each of the target speakers is adapted from the gender specific UBM with MAP by adapting means only.

- Score normaliztion

Tnorm is performed on the verification scores. The cohort speakers for Tnorm are selected from NIST SREĄŻ04, including 368 cohort speakers for female trials and 248 cohort speakers for male trials.

- Decision

The log likelihood scores are compared with a threshold to make the decision. This threshold is set on the best DCF point estimated on SREĄŻ05 corpus. (The confidence scores can be interpreted as likelihood ratios)

- Absolute processing time

The time used for training models in 1conv4w is: 7434.8 seconds.
The time used for processing test segments (1conv4w-1conv4w) is: 23846.9 seconds.
The time used for processing test segments (1conv4w-1conv2w) is: 73846.7 seconds.

The time used for training models in 2conv2w is: 129441.6 seconds.
The time used for processing test segments (3conv2w-1conv2w) is: 73851.2 seconds.

- Computer configuration

The evaluations were run on a PC with single Intel P4 3.0-GHz CPU and 1G memory.