Authors:
Lisa R. Yanguas,
Thomas F Quatieri,
Fred Goodman,
Page (NA) Paper number 1573
Abstract:
In this paper we explore the importance of speaker specific information
carried in the glottal source by interchanging glottal excitation and
vocal tract system information for speaker pairs through analysis/synthesis
techniques. Earlier work has shown the importance of glottal flow derivative
waveform model parameters in automatic speaker identification [2] and
also in voice style (i.e., creaky or aspirated) [1]. In our work, after
matching phonemes across utterance pairs, we separately interchange
three excitation characteristics: timing, pitch, and glottal flow derivative
and investigate their relative importance in two case studies. Through
time alignment and pitch and glottal flow transformations, we can make
a speaker of a northern dialect sound more like his southern counterpart,
and also through these processes a Peruvian speaker is made to sound
more Cuban-like. From these experiments we conclude that significant
speaker and dialect specific information, such as noise, breathiness
or aspiration, and vocalization and stridency, is carried in the glottal
signal. Based on these findings, it appears that although dialect identification
has usually been approached in a manner similar to language identification,
it may actually be more closely related to the speaker identification
task. The work suggests further study in linking speaker and dialect
identification more closely. We plan, for example, to explore other
speaker identification approaches to dialect identification in the
hope of improved performance in automatic dialect identification algorithms.
Authors:
Jack McLaughlin,
Douglas A Reynolds,
Elliot Singer,
Gerald C O'Leary,
Page (NA) Paper number 1969
Abstract:
Blind clustering of multi-person utterances by speaker is complicated
by the fact that each utterance has at least two talkers. In the case
of a two-person conversation, one can simply split each conversation
into its respective speaker halves, but this introduces error which
ultimately hurts clustering. We propose a clustering algorithm which
is capable of associating each conversation with two clusters (and
therefore two-speakers) obviating the need for splitting. Results are
given for two speaker conversations culled from the Switchboard corpus,
and comparisons are made to results obtained on single-speaker utterances.
We conclude that although the approach is promising, our technique
for computing inter-conversation similarities prior to clustering needs
improvement.
Authors:
Ivan Magrin-Chagnolleau,
Aaron E Rosenberg,
S. Parthasarathy,
Page (NA) Paper number 1988
Abstract:
The problem of speaker detection in audio databases is addressed in
this paper. Gaussian mixture modeling is used to build target speaker
and background models. A detection algorithm based on a likelihood
ratio calculation is applied to estimate target speaker segments. Evaluation
procedures are defined in detail for this task. Results are given for
different subsets of the HUB4 broadcast news database. For one target
speaker, with the data restricted to high quality speech segments,
the segment miss rate is approximately 7%. For unrestricted data the
segment miss rate is approximately 27%. In both cases the false alarm
rate is 4 or 5 per hour. For two target speakers with unrestricted
data, the segment miss rate is approximately 63% with about 27 segment
false alarms per hour. The decrease in performance for two target speakers
is largely associated with short speech segments in the two target
speaker test data which are undetectable in the current configuration
of the detection algorithm
Authors:
Olivier Siohan,
Chin Hui Lee,
Arun C Surendran,
Qi Li,
Page (NA) Paper number 2068
Abstract:
Most state-of-the art speaker verification systems need a user model
built from samples of the customer speech, and a speaker independent
(SI) background model with high acoustic resolution. These systems
rely heavily on the availability of speaker independent databases along
with a priori knowledge about acoustic rules of the utterance, and
depend on the consistency of acoustic conditions under which the SI
models were trained. These constraints may be a burden in practical
and portable devices such as palm-top computers or wireless handsets
which place a premium on computation and memory, and where the user
is free to choose any password utterance in any language, under any
acoustic condition. In this paper, we present a novel and reliable
approach to background model design when only the enrollment data is
available. Preliminary results are provided to demonstrate the effectiveness
of such systems.
Authors:
Joseph P Campbell Jr,
Douglas A Reynolds,
Page (NA) Paper number 2247
Abstract:
Using standard speech corpora for development and evaluation has proven
to be very valuable in promoting progress in speech and speaker recognition
research. In this paper, we present an overview of current publicly
available corpora intended for speaker recognition research and evaluation.
We outline the corpora's salient features with respect to their suitability
for conducting speaker recognition experiments and evaluations. Links
to these corpora, and to new corpora, will appear on the web http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/.
We hope to increase the awareness and use of these standard corpora
and corresponding evaluation procedures throughout the speaker recognition
community.
Authors:
Francois Pellegrino,
Régine André-Obrecht,
Page (NA) Paper number 2324
Abstract:
This paper presents an unsupervised approach to Automatic Language
Identification (ALI) based on vowel system modeling. Each language
vowel system is modeled by a Gaussian Mixture Model (GMM) trained with
automatically detected vowels. Since this detection is unsupervised
and language independent, no labeled data are required. GMMs are initialized
using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen
algorithm. With 5 language from the OGI MLTS corpus and in a close
set identification task, we reach 79 % of correct identification using
only the vowel segments detected in 45 second duration utterances for
the male speakers.
Authors:
Bryan L. Pellom, Duke University (U.K.)
John H.L. Hansen, Duke University (U.K.)
Page (NA) Paper number 2382
Abstract:
This paper investigates the relative sensitivity of a GMM-based voice
verification algorithm to computer voice-altered imposters. First,
a new trainable speech synthesis algorithm based on trajectory models
of the speech Line Spectral Frequency (LSF) parameters is presented
in order to model the spectral characteristics of a target voice. A
GMM-based speaker verifier is then constructed for the 138 speaker
YOHO database and shown to have an initial equal-error rate (EER) of
1.45% for the case of casual imposter attempts using a single combination-lock
phrase test. Next, imposter voices are automatically altered using
the synthesis algorithm to mimic the customer's voice. After voice
transformation, the false acceptance rate is shown to increase from
1.45% to over 86% if the baseline EER threshold is left unmodified.
Furthermore, at a customer false rejection rate of 25%, the false acceptance
rate for the voice-altered imposter remains as high as 34.6%.
Authors:
Toshihiro Isobe,
Jun-ichi Takahashi,
Page (NA) Paper number 1893
Abstract:
This paper describes a new cohort normalization method for HMM based
speaker verification. In the proposed method, cohort models are synthesized
based on the similarity of local acoustic features between speakers.
The similarity can be determined using acoustic information lying in
model components such as phonemes, states, and the Gaussian distributions
of HMMs. With the method, the synthesized models can provide an effective
normalizing score for various observed measurements because the difference
between the individual reference model and the synthesized cohort models
is statistically reduced through fine evaluation of acoustic similarity
in model structure level. In the experiments using telephone speech
of 100 speakers, it was found that high verification performance can
be achieved by the proposed method: the Equal Error Rate (EER) was
drastically reduced from 1.20% (obtained by the conventional speaker-selection
based cohort normalization) to 0.30 % (obtained by the proposed method
on distribution-based selection) in closed test. Furthermore, EER was
also reduced from 1.40 % to 0.70% in open test (reference speaker:
25, impostor: 75), when the other speakers than the reference speaker
were used as impostors.
Authors:
Li Liu,
Jialong He,
Page (NA) Paper number 1022
Abstract:
The Gaussian mixture modeling (GMM) techniques are increasingly being
used for both speaker identification and verification. Most of these
models assume diagonal covariance matrices. Although empirically any
distribution can be approximated with a diagonal GMM, a large number
of mixture components are usually needed to obtain a good approximation.
A consequence of using a large GMM is that its training is time consuming
and its response speed is very slow. This paper proposes a modification
to the standard diagonal GMM approach. The proposed scheme includes
an orthogonal transformation: feature vectors are first transformed
to the space spanned by the eigenvectors of the covariance matrix before
applying to the diagonal GMM. Only a small computational load is introduced
by this transformation, but results from both speaker identification
and verification experiments indicated that the orthogonal transformation
considerably improves the recognition performance. For a specific performance
level, the GMM with orthogonal transform needs only one-fourth the
number of Gaussian functions required by the standard GMM.
Authors:
Stephen A Zahorian,
Page (NA) Paper number 2049
Abstract:
A neural network algorithm for speaker identification with large groups
of speakers is described. This technique is derived from a technique
in which an N-way speaker identification task is partitioned into N*(N-1)/2
two-way classification tasks. Each two-way classification task is performed
using a small neural network which is a two-way, or pair-wise, network.
The decisions of these two-way networks are then combined to make the
N-way speaker identification decision (Rudasi and Zahorian, 1991 and
1992). Although very accurate, this method has the drawback of requiring
a very large number of pair-wise networks. In the new approach, two-way
neural network classifiers, each of which is trained only to separate
two speakers, are also used to separate other pairs of speakers. This
method is able to greatly reduce the number of pair-wise classifiers
required for making an N-way classification decision, especially when
the number of speakers is very large. For 100 speakers extracted from
the TIMIT database, the number of pair-wise classifiers can be reduced
by approximately a factor of 5, with only minor degradation in performance
when 3 seconds or more of speech is used for identification. Using
all 630 speakers from the TIMIT database, this method can be used to
obtain over 99.7% accuracy. With the telephone version of the same
database, an accuracy of 40.2% can be obtained.
|