1:00, SPEECH-L10.1
LANGUAGE DEPENDENCY IN TEXT-INDEPENDENT SPEAKER VERIFICATION
R. AUCKENTHALER, M. CAREY, J. MASON
Applying speech technology in appliances available around the world cannot restrict the functionality to a certain language. However, most of today’s text-independent verification systems based on Gaussian mixture models, GMMs, use an adaptive approach for training the speaker model. This assumes that the world model incorporates the same language as that of the target speaker.
In this paper we investigate language mismatches between the target speaker and the world model in a GMM speaker verification system. Experiments performed with different world model languages showed major degradations, in particular for Mandarin and Vietnamese when the target speakers spoke American English. Experiments with world models trained on data pooled from different languages revealed only minor performance degradations.
1:20, SPEECH-L10.2
LANGUAGE-INDEPENDENT, SHORT-ENROLLMENT VOICE VERIFICATION OVER A FAR-FIELD MICROPHONE
J. BELLEGARDA, D. NAIK, M. NEERACHER, K. SILVERMAN
A new approach is presented for the dual verification of speaker
identity and verbal content in a text-dependent voice authentication
system. The application considered is desktop voice login over a
far-field microphone. Each speaker is allowed to select his or her own
keyphrase, and enrollment is limited to four instances of the
keyphrase, each 1 to 2 seconds of speech. The approach decouples the
analysis of speaker and verbal content information, so as to use two
light-weight components for verification: a spectral matching
component based on a global representation of the entire utterance,
and a temporal alignment component based on more conventional
frame-level information. The resulting integration is
language-independent, and experiments with deliberate imposture show
an equal error rate figure of approximately 4%. The approach has been
commercially deployed in the "VoicePrint Password" feature of
MacOS 9.
1:40, SPEECH-L10.3
PASSWORD-DEPENDENT SPEAKER VERIFICATION USING QUANTIZED ACOUSTIC TRAJECTORIES
L. GAGNON, P. STUBLEY, G. MAILHOT
Speaker verification requires either two steps (identity claim and
verification) or the use of speech recognition to determine the
password phrase. The single step method using speech recognition is text- and language-dependent. We describe a novel single-step method based on Gaussian mixture models and quantized acoustic trajectories that does not use any linguistic knowledge and is thus text- and language-independent. Although a two-step process can be more accurate, our approach is significantly better than speaker identification and is more convenient than a two-step process.
2:00, SPEECH-L10.4
A COMBINATION BETWEEN VQ AND COVARIANCE MATRICES FOR SPEAKER RECOGNITION
M. FAUNDEZ-ZANUY
This paper presents a new algorithm for speaker recognition based on the combination between the classical Vector Quantization (VQ) and Covariance Matrix (CM) methods. The combined VQ-CM method improves the identification rates of each method alone, with comparable computational burden. It offers a straightforward procedure to obtain a model similar to GMM with full covariance matrices. Experimental results also show that it is more robust against noise than VQ or CM alone.
2:20, SPEECH-L10.5
TEXT-DEPENDENT SPEAKER VERIFICATION UNDER NOISY CONDITIONS USING PARALLEL MODEL COMBINATION
L. WONG, M. RUSSELL
In real speaker verification applications, additive or convolutive noise creates a mismatch between training and recognition environments, degrading performance. Parallel Model Combination (PMC) is used successfully to improve the noise robustness of Hidden Markov Model (HMM) based speech recognisers. This paper presents the results of applying PMC to compensate for additive noise in HMM-based text-dependent speaker verification. Speech and noise data were obtained from the YOHO and NOISEX-92 databases respectively. Speaker recognition Equal Error Rates (EER) are presented for noise-contaminated speech at different signal-to-noise ratios (SNRs) and different noise sources. For example, average EER for speech in operations room noise at 6dB SNR dropped from approximately 20% un-compensated to less than 5% using PMC. Finally, it is shown that speaker recognition performance is relatively insensitive to the exact value of the parameter that determines the relative amplitudes of the speech and noise components of the PMC model.
2:40, SPEECH-L10.6
VERY LARGE POPULATION TEXT-INDEPENDENT SPEAKER IDENTIFICATION USING TRANSFORMATION ENHANCED MULTI-GRAINED MODELS
U. CHAUDHARI, J. NAVRATIL, G. RAMASWAMY, S. MAES
The paper presents results on speaker identification with a population
size of over 10000 speakers. Speaker modeling is accomplished via our
Transformation Enhanced Multi-Grained Models. Pursuing two goals, the
first is to study the performance of a number of different systems
within the modeling framework of multi-grained models. The second is
to analyze performance as a function of population size. We show that
the most complex models within the framework perform the best and
demonstrate that, in approximation, the identification error rate
scales linearly with the log of the population size for the described
system. Further, we develop a candidate rejection technique based on
our analysis of the system performance which indicates a low
confidence in the identity chosen.