Topics in Speaker and Language Recognition

Home
Full List of Titles
1: Speech Processing
CELP Coding
Large Vocabulary Recognition
Speech Analysis and Enhancement
Acoustic Modeling I
ASR Systems and Applications
Topics in Speech Coding
Speech Analysis
Low Bit Rate Speech Coding I
Robust Speech Recognition in Noisy Environments
Speaker Recognition
Acoustic Modeling II
Speech Production and Synthesis
Feature Extraction
Robust Speech Recognition and Adaptation
Low Bit Rate Speech Coding II
Speech Understanding
Language Modeling I
2: Speech Processing, Audio and Electroacoustics, and Neural Networks
Acoustic Modeling III
Lexical Issues/Search
Speech Understanding and Systems
Speech Analysis and Quantization
Utterance Verification/Acoustic Modeling
Language Modeling II
Adaptation /Normalization
Speech Enhancement
Topics in Speaker and Language Recognition
Echo Cancellation and Noise Control
Coding
Auditory Modeling, Hearing Aids and Applications of Signal Processing to Audio and Acoustics
Spatial Audio
Music Applications
Application - Pattern Recognition & Speech Processing
Theory & Neural Architecture
Signal Separation
Application - Image & Nonlinear Signal Processing
3: Signal Processing Theory & Methods I
Filter Design and Structures
Detection
Wavelets
Adaptive Filtering: Applications and Implementation
Nonlinear Signals and Systems
Time/Frequency and Time/Scale Analysis
Signal Modeling and Representation
Filterbank and Wavelet Applications
Source and Signal Separation
Filterbanks
Emerging Applications and Fast Algorithms
Frequency and Phase Estimation
Spectral Analysis and Higher Order Statistics
Signal Reconstruction
Adaptive Filter Analysis
Transforms and Statistical Estimation
Markov and Bayesian Estimation and Classification
4: Signal Processing Theory & Methods II, Design and Implementation of Signal Processing Systems, Special Sessions, and Industry Technology Tracks
System Identification, Equalization, and Noise Suppression
Parameter Estimation
Adaptive Filters: Algorithms and Performance
DSP Development Tools
VLSI Building Blocks
DSP Architectures
DSP System Design
Education
Recent Advances in Sampling Theory and Applications
Steganography: Information Embedding, Digital Watermarking, and Data Hiding
Speech Under Stress
Physics-Based Signal Processing
DSP Chips, Architectures and Implementations
DSP Tools and Rapid Prototyping
Communication Technologies
Image and Video Technologies
Automotive Applications / Industrial Signal Processing
Speech and Audio Technologies
Defense and Security Applications
Biomedical Applications
Voice and Media Processing
Adaptive Interference Cancellation
5: Communications, Sensor Array and Multichannel
Source Coding and Compression
Compression and Modulation
Channel Estimation and Equalization
Blind Multiuser Communications
Signal Processing for Communications I
CDMA and Space-Time Processing
Time-Varying Channels and Self-Recovering Receivers
Signal Processing for Communications II
Blind CDMA and Multi-Channel Equalization
Multicarrier Communications
Detection, Classification, Localization, and Tracking
Radar and Sonar Signal Processing
Array Processing: Direction Finding
Array Processing Applications I
Blind Identification, Separation, and Equalization
Antenna Arrays for Communications
Array Processing Applications II
6: Multimedia Signal Processing, Image and Multidimensional Signal Processing, Digital Signal Processing Education
Multimedia Analysis and Retrieval
Audio and Video Processing for Multimedia Applications
Advanced Techniques in Multimedia
Video Compression and Processing
Image Coding
Transform Techniques
Restoration and Estimation
Image Analysis
Object Identification and Tracking
Motion Estimation
Medical Imaging
Image and Multidimensional Signal Processing Applications I
Segmentation
Image and Multidimensional Signal Processing Applications II
Facial Recognition and Analysis
Digital Signal Processing Education

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Implications of Glottal Source for Speaker and Dialect Identification

Authors:

Lisa R. Yanguas,
Thomas F Quatieri,
Fred Goodman,

Page (NA) Paper number 1573

Abstract:

In this paper we explore the importance of speaker specific information carried in the glottal source by interchanging glottal excitation and vocal tract system information for speaker pairs through analysis/synthesis techniques. Earlier work has shown the importance of glottal flow derivative waveform model parameters in automatic speaker identification [2] and also in voice style (i.e., creaky or aspirated) [1]. In our work, after matching phonemes across utterance pairs, we separately interchange three excitation characteristics: timing, pitch, and glottal flow derivative and investigate their relative importance in two case studies. Through time alignment and pitch and glottal flow transformations, we can make a speaker of a northern dialect sound more like his southern counterpart, and also through these processes a Peruvian speaker is made to sound more Cuban-like. From these experiments we conclude that significant speaker and dialect specific information, such as noise, breathiness or aspiration, and vocalization and stridency, is carried in the glottal signal. Based on these findings, it appears that although dialect identification has usually been approached in a manner similar to language identification, it may actually be more closely related to the speaker identification task. The work suggests further study in linking speaker and dialect identification more closely. We plan, for example, to explore other speaker identification approaches to dialect identification in the hope of improved performance in automatic dialect identification algorithms.

IC991573.PDF (From Author) IC991573.PDF (Rasterized)

TOP


Automatic Speaker Clustering from Multi-speaker Utterances

Authors:

Jack McLaughlin,
Douglas A Reynolds,
Elliot Singer,
Gerald C O'Leary,

Page (NA) Paper number 1969

Abstract:

Blind clustering of multi-person utterances by speaker is complicated by the fact that each utterance has at least two talkers. In the case of a two-person conversation, one can simply split each conversation into its respective speaker halves, but this introduces error which ultimately hurts clustering. We propose a clustering algorithm which is capable of associating each conversation with two clusters (and therefore two-speakers) obviating the need for splitting. Results are given for two speaker conversations culled from the Switchboard corpus, and comparisons are made to results obtained on single-speaker utterances. We conclude that although the approach is promising, our technique for computing inter-conversation similarities prior to clustering needs improvement.

IC991969.PDF (From Author) IC991969.PDF (Rasterized)

TOP


Detection of Target Speakers in Audio Databases

Authors:

Ivan Magrin-Chagnolleau,
Aaron E Rosenberg,
S. Parthasarathy,

Page (NA) Paper number 1988

Abstract:

The problem of speaker detection in audio databases is addressed in this paper. Gaussian mixture modeling is used to build target speaker and background models. A detection algorithm based on a likelihood ratio calculation is applied to estimate target speaker segments. Evaluation procedures are defined in detail for this task. Results are given for different subsets of the HUB4 broadcast news database. For one target speaker, with the data restricted to high quality speech segments, the segment miss rate is approximately 7%. For unrestricted data the segment miss rate is approximately 27%. In both cases the false alarm rate is 4 or 5 per hour. For two target speakers with unrestricted data, the segment miss rate is approximately 63% with about 27 segment false alarms per hour. The decrease in performance for two target speakers is largely associated with short speech segments in the two target speaker test data which are undetectable in the current configuration of the detection algorithm

IC991988.PDF (From Author) IC991988.PDF (Rasterized)

TOP


Background Model Design for Flexible and Portable Speaker Verification Systems

Authors:

Olivier Siohan,
Chin Hui Lee,
Arun C Surendran,
Qi Li,

Page (NA) Paper number 2068

Abstract:

Most state-of-the art speaker verification systems need a user model built from samples of the customer speech, and a speaker independent (SI) background model with high acoustic resolution. These systems rely heavily on the availability of speaker independent databases along with a priori knowledge about acoustic rules of the utterance, and depend on the consistency of acoustic conditions under which the SI models were trained. These constraints may be a burden in practical and portable devices such as palm-top computers or wireless handsets which place a premium on computation and memory, and where the user is free to choose any password utterance in any language, under any acoustic condition. In this paper, we present a novel and reliable approach to background model design when only the enrollment data is available. Preliminary results are provided to demonstrate the effectiveness of such systems.

IC992068.PDF (From Author) IC992068.PDF (Rasterized)

TOP


Corpora For The Evaluation Of Speaker Recognition Systems

Authors:

Joseph P Campbell Jr,
Douglas A Reynolds,

Page (NA) Paper number 2247

Abstract:

Using standard speech corpora for development and evaluation has proven to be very valuable in promoting progress in speech and speaker recognition research. In this paper, we present an overview of current publicly available corpora intended for speaker recognition research and evaluation. We outline the corpora's salient features with respect to their suitability for conducting speaker recognition experiments and evaluations. Links to these corpora, and to new corpora, will appear on the web http://www.apl.jhu.edu/Classes/Notes/Campbell/SpkrRec/. We hope to increase the awareness and use of these standard corpora and corresponding evaluation procedures throughout the speaker recognition community.

IC992247.PDF (From Author) IC992247.PDF (Rasterized)

TOP


An Unsupervised Approach to Language Identification

Authors:

Francois Pellegrino,
Régine André-Obrecht,

Page (NA) Paper number 2324

Abstract:

This paper presents an unsupervised approach to Automatic Language Identification (ALI) based on vowel system modeling. Each language vowel system is modeled by a Gaussian Mixture Model (GMM) trained with automatically detected vowels. Since this detection is unsupervised and language independent, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen algorithm. With 5 language from the OGI MLTS corpus and in a close set identification task, we reach 79 % of correct identification using only the vowel segments detected in 45 second duration utterances for the male speakers.

IC992324.PDF (From Author) IC992324.PDF (Rasterized)

TOP


An Experimental Study Of Speaker Verification Sensitivity To Computer Voice-Altered Imposters

Authors:

Bryan L. Pellom, Duke University (U.K.)
John H.L. Hansen, Duke University (U.K.)

Page (NA) Paper number 2382

Abstract:

This paper investigates the relative sensitivity of a GMM-based voice verification algorithm to computer voice-altered imposters. First, a new trainable speech synthesis algorithm based on trajectory models of the speech Line Spectral Frequency (LSF) parameters is presented in order to model the spectral characteristics of a target voice. A GMM-based speaker verifier is then constructed for the 138 speaker YOHO database and shown to have an initial equal-error rate (EER) of 1.45% for the case of casual imposter attempts using a single combination-lock phrase test. Next, imposter voices are automatically altered using the synthesis algorithm to mimic the customer's voice. After voice transformation, the false acceptance rate is shown to increase from 1.45% to over 86% if the baseline EER threshold is left unmodified. Furthermore, at a customer false rejection rate of 25%, the false acceptance rate for the voice-altered imposter remains as high as 34.6%.

IC992382.PDF (From Author) IC992382.PDF (Rasterized)

TOP


A New Cohort Normalization Using Local Acoustic Information For Speaker Verification

Authors:

Toshihiro Isobe,
Jun-ichi Takahashi,

Page (NA) Paper number 1893

Abstract:

This paper describes a new cohort normalization method for HMM based speaker verification. In the proposed method, cohort models are synthesized based on the similarity of local acoustic features between speakers. The similarity can be determined using acoustic information lying in model components such as phonemes, states, and the Gaussian distributions of HMMs. With the method, the synthesized models can provide an effective normalizing score for various observed measurements because the difference between the individual reference model and the synthesized cohort models is statistically reduced through fine evaluation of acoustic similarity in model structure level. In the experiments using telephone speech of 100 speakers, it was found that high verification performance can be achieved by the proposed method: the Equal Error Rate (EER) was drastically reduced from 1.20% (obtained by the conventional speaker-selection based cohort normalization) to 0.30 % (obtained by the proposed method on distribution-based selection) in closed test. Furthermore, EER was also reduced from 1.40 % to 0.70% in open test (reference speaker: 25, impostor: 75), when the other speakers than the reference speaker were used as impostors.

IC991893.PDF (From Author) IC991893.PDF (Rasterized)

TOP


On The Use Of Orthogonal GMM In Speaker Recognition

Authors:

Li Liu,
Jialong He,

Page (NA) Paper number 1022

Abstract:

The Gaussian mixture modeling (GMM) techniques are increasingly being used for both speaker identification and verification. Most of these models assume diagonal covariance matrices. Although empirically any distribution can be approximated with a diagonal GMM, a large number of mixture components are usually needed to obtain a good approximation. A consequence of using a large GMM is that its training is time consuming and its response speed is very slow. This paper proposes a modification to the standard diagonal GMM approach. The proposed scheme includes an orthogonal transformation: feature vectors are first transformed to the space spanned by the eigenvectors of the covariance matrix before applying to the diagonal GMM. Only a small computational load is introduced by this transformation, but results from both speaker identification and verification experiments indicated that the orthogonal transformation considerably improves the recognition performance. For a specific performance level, the GMM with orthogonal transform needs only one-fourth the number of Gaussian functions required by the standard GMM.

IC991022.PDF (From Author) IC991022.PDF (Rasterized)

TOP


Reusable Binary-Paired Partitioned Neural Networks for Text-Independent Speaker Identification

Authors:

Stephen A Zahorian,

Page (NA) Paper number 2049

Abstract:

A neural network algorithm for speaker identification with large groups of speakers is described. This technique is derived from a technique in which an N-way speaker identification task is partitioned into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a small neural network which is a two-way, or pair-wise, network. The decisions of these two-way networks are then combined to make the N-way speaker identification decision (Rudasi and Zahorian, 1991 and 1992). Although very accurate, this method has the drawback of requiring a very large number of pair-wise networks. In the new approach, two-way neural network classifiers, each of which is trained only to separate two speakers, are also used to separate other pairs of speakers. This method is able to greatly reduce the number of pair-wise classifiers required for making an N-way classification decision, especially when the number of speakers is very large. For 100 speakers extracted from the TIMIT database, the number of pair-wise classifiers can be reduced by approximately a factor of 5, with only minor degradation in performance when 3 seconds or more of speech is used for identification. Using all 630 speakers from the TIMIT database, this method can be used to obtain over 99.7% accuracy. With the telephone version of the same database, an accuracy of 40.2% can be obtained.

IC992049.PDF (From Author) IC992049.PDF (Rasterized)

TOP