Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-26 |
|
SP-26.1
|
Implications of Glottal Source for Speaker and Dialect Identification
Lisa R. Yanguas,
Thomas F Quatieri (MIT Lincoln Laboratory),
Fred Goodman (Autometric)
In this paper we explore the importance of speaker specific information
carried in the glottal source by interchanging glottal excitation and
vocal tract system information for speaker pairs through
analysis/synthesis techniques. Earlier work has shown the importance
of glottal flow derivative waveform model parameters in automatic
speaker identification [2] and also in voice style (i.e., creaky or
aspirated) [1]. In our work, after matching phonemes across utterance
pairs, we separately interchange three excitation characteristics:
timing, pitch, and glottal flow derivative and investigate their
relative importance in two case studies. Through time alignment and
pitch and glottal flow transformations, we can make a speaker of a
northern dialect sound more like his southern counterpart, and also
through these processes a Peruvian speaker is made to sound more
Cuban-like. From these experiments we conclude that significant
speaker and dialect specific information, such as noise, breathiness or
aspiration, and vocalization and stridency, is carried in the glottal
signal. Based on these findings, it appears that although dialect
identification has usually been approached in a manner similar to
language identification, it may actually be more closely related to the
speaker identification task. The work suggests further study in linking
speaker and dialect identification more closely. We plan, for example,
to explore other speaker identification approaches to dialect
identification in the hope of improved performance in automatic dialect
identification algorithms.
|
SP-26.2
|
Automatic Speaker Clustering from Multi-speaker Utterances
Jack McLaughlin,
Douglas Reynolds,
Elliot Singer,
Gerald C O'Leary (MIT Lincoln Laboratory)
Blind clustering of multi-person utterances by speaker
is complicated by the fact that each utterance has at least two
talkers. In the case of a two-person conversation, one can
simply split each conversation into its respective speaker
halves, but this introduces error which ultimately hurts
clustering. We propose a clustering algorithm which is
capable of associating each conversation with two clusters
(and therefore two-speakers) obviating the need for splitting.
Results are given for two speaker conversations culled from
the Switchboard corpus, and comparisons are made to
results obtained on single-speaker utterances. We conclude
that although the approach is promising, our technique for
computing inter-conversation similarities prior to clustering
needs improvement.
|
SP-26.3
|
Detection of Target Speakers in Audio Databases
Ivan Magrin-Chagnolleau,
Aaron E Rosenberg,
S. Parthasarathy (AT&T Labs-Research)
The problem of speaker detection in audio databases is addressed
in this paper. Gaussian mixture modeling is used to build
target speaker and background models. A detection algorithm
based on a likelihood ratio calculation is applied to
estimate target speaker segments. Evaluation procedures
are defined in detail for this task. Results are given
for different subsets of the HUB4 broadcast news database.
For one target speaker, with the data restricted to high
quality speech segments, the segment miss rate is approximately
7%. For unrestricted data the segment miss rate is
approximately 27%. In both cases the false alarm rate
is 4 or 5 per hour. For two target speakers with unrestricted
data, the segment miss rate is approximately 63% with about
27 segment false alarms per hour. The decrease in performance
for two target speakers is largely associated with short
speech segments in the two target speaker test data which
are undetectable in the current configuration of the
detection algorithm
|
SP-26.4
|
Background Model Design for Flexible and Portable Speaker Verification Systems
Olivier Siohan,
Chin H Lee,
Arun C Surendran,
Qi Li (Bell Laboratories - Lucent Technologies)
Most state-of-the art speaker verification systems need a user model
built from samples of the customer speech, and a speaker independent
(SI) background model with high acoustic resolution. These systems
rely heavily on the availability of speaker independent databases
along with a priori knowledge about acoustic rules of the
utterance, and depend on the consistency of acoustic conditions under
which the SI models were trained. These constraints may be a burden
in practical and portable devices such as palm-top computers or
wireless handsets which place a premium on computation and memory, and
where the user is free to choose any password utterance in any
language, under any acoustic condition. In this paper, we present a
novel and reliable approach to background model design when only the
enrollment data is available. Preliminary results are provided to
demonstrate the effectiveness of such systems.
|
SP-26.5
|
CORPORA FOR THE EVALUATION OF SPEAKER RECOGNITION SYSTEMS
Joseph P Campbell (Department of Defense),
Douglas A Reynolds (MIT Lincoln Laboratory)
Using standard speech corpora for development and
evaluation has proven to be very valuable in promoting
progress in speech and speaker recognition research.
In this paper, we present an overview of current
publicly available corpora intended for speaker
recognition research and evaluation. We outline the
corpora’s salient features with respect to their
suitability for conducting speaker recognition
experiments and evaluations. Links to these corpora,
and to new corpora, will appear on the web
http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/.
We hope to increase the awareness and use of these
standard corpora and corresponding evaluation
procedures throughout the speaker recognition
community.
|
SP-26.6
|
An Unsupervised Approach to Language Identification
Francois Pellegrino,
Régine André-Obrecht (IRIT - Institut de Recherche en Informatique de Toulouse)
This paper presents an unsupervised approach to Automatic Language Identification (ALI) based on vowel system modeling. Each language vowel system is modeled by a Gaussian Mixture Model (GMM) trained with automatically detected vowels. Since this detection is unsupervised and language independent, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen algorithm.
With 5 language from the OGI MLTS corpus and in a close set identification task, we reach 79 % of correct identification using only the vowel segments detected in 45 second duration utterances for the male speakers.
|
SP-26.7
|
AN EXPERIMENTAL STUDY OF SPEAKER VERIFICATION SENSITIVITY TO COMPUTER VOICE-ALTERED IMPOSTERS
Bryan L. Pellom,
John H.L. Hansen (Duke University)
This paper investigates the relative sensitivity of a GMM-based voice verification
algorithm to computer voice-altered imposters. First, a new trainable speech
synthesis algorithm based on trajectory models of the speech Line Spectral Frequency
(LSF) parameters is presented in order to model the spectral characteristics of a
target voice. A GMM-based speaker verifier is then constructed for the 138 speaker
YOHO database and shown to have an initial equal-error rate (EER) of 1.45% for
the case of casual imposter attempts using a single combination-lock phrase test.
Next, imposter voices are automatically altered using the synthesis algorithm to
mimic the customer's voice. After voice transformation, the false acceptance rate
is shown to increase from 1.45% to over 86% if the baseline EER threshold is left
unmodified. Furthermore, at a customer false rejection rate of 25%, the false
acceptance rate for the voice-altered imposter remains as high as 34.6%.
|
SP-26.8
|
A NEW COHORT NORMALIZATION USING LOCAL ACOUSTIC INFORMATION FOR SPEAKER VERIFICATION
Toshihiro Isobe,
Jun-ichi Takahashi (Laboratory for Information Technology, NTT DATA CORPORATION)
This paper describes a new cohort normalization method for HMM based speaker verification. In the proposed method, cohort models are synthesized based on the similarity of local acoustic features between speakers. The similarity can be determined using acoustic information lying in model components such as phonemes, states, and the Gaussian distributions of HMMs. With the method, the synthesized models can provide an effective normalizing score for various observed measurements because the difference between the individual reference model and the synthesized cohort models is statistically reduced through fine evaluation of acoustic similarity in model structure level. In the experiments using telephone speech of 100 speakers, it was found that high verification performance can be achieved by the proposed method: the Equal Error Rate (EER) was drastically reduced from 1.20 % (obtained by the conventional speaker-selection based cohort normalization) to 0.30 % (obtained by the proposed method on distribution-based selection) in closed test. Furthermore, EER was also reduced from 1.40 % to 0.70% in open test (reference speaker: 25, impostor: 75), when the other speakers than the reference speaker were used as impostors.
|
SP-26.9
|
ON THE USE OF ORTHOGONAL GMM IN SPEAKER RECOGNITION
Li Liu,
Jialong He (Dept. of Speech & Hearing Science, Arizona State University)
The Gaussian mixture modeling (GMM) techniques are increasingly being used for both speaker identification and verification. Most of these models assume diagonal covariance matrices. Although empirically any distribution can be approximated with a diagonal GMM, a large number of mixture components are usually needed to obtain a good approximation. A consequence of using a large GMM is that its training is time consuming and its response speed is very slow. This paper proposes a modification to the standard diagonal GMM approach. The proposed scheme includes an orthogonal transformation: feature vectors are first transformed to the space spanned by the eigenvectors of the covariance matrix before applying to the diagonal GMM. Only a small computational load is introduced by this transformation, but results from both speaker identification and verification experiments indicated that the orthogonal transformation considerably improves the recognition performance. For a specific performance level, the GMM with
orthogonal transform needs only one-fourth the number of Gaussian functions required by the standard GMM.
|
SP-26.10
|
Reusable Binary-Paired Partitioned Neural Networks for Text-Independent Speaker Identification
Stephen A Zahorian (Old Dominion University)
A neural network algorithm for speaker identification with large groups of speakers is described. This technique is derived from a technique in which an N-way speaker identification task is partitioned into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a small neural network which is a two-way, or pair-wise, network. The decisions of these two-way networks are then combined to make the N-way speaker identification decision (Rudasi and Zahorian, 1991 and 1992). Although very accurate, this method has the drawback of requiring a very large number of pair-wise networks. In the new approach, two-way neural network classifiers, each of which is trained only to separate two speakers, are also used to separate other pairs of speakers. This method is able to greatly reduce the number of pair-wise classifiers required for making an N-way classification decision, especially when the number of speakers is very large. For 100 speakers extracted from the TIMIT database, the number of pair-wise classifiers can be reduced by approximately a factor of 5, with only minor degradation in performance when 3 seconds or more of speech is used for identification. Using all 630 speakers from the TIMIT database, this method can be used to obtain over 99.7% accuracy. With the telephone version of the same database, an accuracy of 40.2% can be obtained.
|
|