Chair: S. Parthasarathy, AT&T Bell Laboratories (USA)
Shoji Hayakawa, Nagoya University (JAPAN)
Fumitada Itakura, Nagoya University (JAPAN)
In our previous studies, we have shown the effectiveness of using the information in the higher frequency band for speaker recognition. However, the energy spectrum of speech in the higher frequency band is weak, except for some fricative sounds. Therefore, it is important to investigate the speaker individual information in that region under noisy conditions. In this study, we examine the influence of additive noises on the performance of speaker recognition using the higher frequency band. Experimental results show that high performance is obtained in the wideband case under many typical noisy conditions. It is also shown that the higher frequency band is more stable against noises than the lower one. For that reason, the higher frequency band gives good performance even if the SNR of the higher frequency region is worse than the lower one.
C.R. Jankowski Jr., MIT Lincoln Laboratory (USA)
T.F. Quatieri, MIT Lincoln Laboratory (USA)
D.A. Reynolds, MIT Lincoln Laboratory (USA)
The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants, high resolution measurement of fundamental frequency and location of ``secondary pulses,'' measured using a high-resolution energy operator. When these features are added to traditional features using an existing SID system with a 168 speaker telephone speech database, SID performance improved by as much as 4% for male speakers and 8.2% for female speakers.
D.A. Reynolds, MIT Lincoln Laboratory (USA)
M.A. Zissman, MIT Lincoln Laboratory (USA)
T.F. Quatieri, MIT Lincoln Laboratory (USA)
G.C. O'Leary, MIT Lincoln Laboratory (USA)
B.A. Carlson, MIT Lincoln Laboratory (USA)
The two largest factors affecting automatic speaker identification performance are the size of the population and the degradations introduced by noisy communication channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively. This paper also presents experiments which examine and attempt to quantify the performance loss associated with various telephone degradations by systematically degrading the TIMIT speech in a manner consistent with measured NTIMIT degradations and measuring the performance loss at each step. It is found that the standard degradations of filtering and additive noise do not account for all of the performance gap between the TIMIT and NTIMIT data. Measurements of nonlinear microphone distortions are also described which may explain the additional performance loss.
Michael Schmidt, BBN Systems and Technologies (USA)
Herbert Gish, BBN Systems and Technologies (USA)
Angela Mielke, BBN Systems and Technologies (USA)
Two novel channel robust methods are described for performing text- independent speaker identification. The first technique models speaker's voices stochastically via cepstra correlations rather than by covariances in an effort to compensate for additive noise. The second technique, which we term dynamic covariances, models speakers by covariances of deviations of cepstra from time varying means rather than from constant means. Dynamic covariances may normalize for time varying channel effects, utterance lengths and text. Experimental results are obtained on the SPIDRE subset of the Switchboard corpus. Error rates as low as 2.2% are obtained using the new models.
William Y. Huang, ITT Aerospace Communications
Bhaskar D. Rao, University of California (USA)
The performance of text dependent, short utterance speaker verification systems degrade significantly with channel and background artifacts. We investigate maximum likelihood and adaptive techniques to compensate for a stationary channel and noise. Maximum likelihood channel and noise compensation was introduced by Cox and Bridle in 1989, and has been shown to be effective in many other speech applications. For adaptive estimation, a Bussgang like algorithm is developed which is more suitable for real-time implementation. These techniques are evaluated on a speaker verification system that uses the nearest neighbor metric. Our results show that for telephone speech with channel differences, channel compensation can provide substantial performance improvement. For un- cooperative speakers, background compensation resulted in a 35% improvement.
Joseph P. Campbell Jr., U.S. Department of Defense (USA)
A standard database for testing voice verification systems, called YOHO, is now available from the Linguistic Data Consortium (LDC). The purpose of this database is to enable research, spark competition, and provide a means for comparative performance assessments between various voice verification systems. A test plan is presented for the suggested use of the LDC's YOHO CD-ROM for testing voice verification systems. This plan is based upon ITT's voice verification test methodology as described by Higgins, et al., but differs slightly in order to match the LDC's CD-ROM version of YOHO and to accommodate different systems. Test results of several algorithms using YOHO are also presented.
Chi-Shi Liu, Ministry of Transportation and Communications
Hsiao-Chaun Wang, National Tsing Hua University (TAIWAN)
Frank K. Soong, AT&T Bell Laboratories (USA)
Chao-Shih Huang, Ministry of Transportation and Communications (TAIWAN)
In this paper, a segmental probabilistic model based on an orthogonal polynomial representation of speech signals is proposed. Unlike the conventional frame based probabilistic model, this segment based model concatenates the similar acoustic characteristics of consecutive frames into an acoustic segment and represents the segment by an orthogonal polynomial function. An algorithm which iteratively performs recognition and segmentation processes is proposed for estimating the parameters of the segment model. This segment model is applied in the text independent speaker verification. For a 20-speaker database, the experimental results show that the performance by using segment models is better than that by using the conventional frame based probabilistic model. The equal error rate can be reduced by 3.6% when the models are represented by 64-mixture density functions.
Kevin R. Farrell, Dictaphone Corporation (USA)
A new system is presented for text-dependent speaker verification. The system uses data fusion concepts to combine the results of distortion-based and discriminant-based classifiers. Hence, both intraspeaker and interspeaker information are utilized in the final decision. The distortion and discriminant-based classifiers are based on dynamic time warping (DTW) and the neural tree network (NTN), respectively. The system is evaluated with several hundred two word utterances collected over a telephone channel. The combined classifier yields an equal error rate of two percent for this task, which is better than the individual performance of either classifier.
M. Mehdi Homayounpour, CNRS/URA (FRANCE)
Gerard Chollet, IDIAP (SWITZERLAND)
The non-supervised Self Organizing Map of Kohonen (SOM), the supervised Learning Vector Quantization algorithm (LVQ3), and two Second-Order Statistical Measures (SOSM) were adapted, evaluated and compared for speaker verification on 57 speakers of a POLYPHONE-like data base. SOM and LVQ3 were trained by codebooks with 32 and 256 codes and two statistical measures(SOSM); one without weighting (SOSM1) and another with weighting (SOSM2) were implemented. As decision criterion, the Equal Error Rate (EER) and Best Match Decison Rule (BMDR) were employed and evaluated. LVQ3 demonstrates a performance advantage over SOM. This is due to the fact that LVQ3 allows the long-term fine-tuning of an interested target codebook using speech data from a client and other speakers, whereas SOM only uses data from the client. SOSM performs better than SOM and LVQ3 for long test utterances, while for short test utterances LVQ is the best method among the methods studied here.
Han-Sheng Liou, Rutgers University (USA)
Richard J. Mammone, Rutgers University (USA)
In this paper, a new algorithm for text-dependent speaker verification is presented. The algorithm uses a set of concatenated Neural Tree Networks (NTN's) trained with subword units for speaker verification. The conventional NTN has been found to provide good performance in text- independent tasks. In the new approach, two types of subword unit are investigated, phone-like units (PLU's) and HMM state-based units (HSU's). The training of the models includes several steps. First, the predetermined password in the training data is segmented into subword units using a Hidden Markov Model (HMM) based segmentation method. Second, an NTN is trained for each subword unit. The new structure integrates the discriminatory ability of the NTN with the temporal models of the HMM. This new algorithm was evaluated by experiments on a TI isolated-word database, and YOHO database. An improvement of performance was observed over the performance obtained using a conventional HMM.