Speech Production and Synthesis

Home
Full List of Titles
1: Speech Processing
CELP Coding
Large Vocabulary Recognition
Speech Analysis and Enhancement
Acoustic Modeling I
ASR Systems and Applications
Topics in Speech Coding
Speech Analysis
Low Bit Rate Speech Coding I
Robust Speech Recognition in Noisy Environments
Speaker Recognition
Acoustic Modeling II
Speech Production and Synthesis
Feature Extraction
Robust Speech Recognition and Adaptation
Low Bit Rate Speech Coding II
Speech Understanding
Language Modeling I
2: Speech Processing, Audio and Electroacoustics, and Neural Networks
Acoustic Modeling III
Lexical Issues/Search
Speech Understanding and Systems
Speech Analysis and Quantization
Utterance Verification/Acoustic Modeling
Language Modeling II
Adaptation /Normalization
Speech Enhancement
Topics in Speaker and Language Recognition
Echo Cancellation and Noise Control
Coding
Auditory Modeling, Hearing Aids and Applications of Signal Processing to Audio and Acoustics
Spatial Audio
Music Applications
Application - Pattern Recognition & Speech Processing
Theory & Neural Architecture
Signal Separation
Application - Image & Nonlinear Signal Processing
3: Signal Processing Theory & Methods I
Filter Design and Structures
Detection
Wavelets
Adaptive Filtering: Applications and Implementation
Nonlinear Signals and Systems
Time/Frequency and Time/Scale Analysis
Signal Modeling and Representation
Filterbank and Wavelet Applications
Source and Signal Separation
Filterbanks
Emerging Applications and Fast Algorithms
Frequency and Phase Estimation
Spectral Analysis and Higher Order Statistics
Signal Reconstruction
Adaptive Filter Analysis
Transforms and Statistical Estimation
Markov and Bayesian Estimation and Classification
4: Signal Processing Theory & Methods II, Design and Implementation of Signal Processing Systems, Special Sessions, and Industry Technology Tracks
System Identification, Equalization, and Noise Suppression
Parameter Estimation
Adaptive Filters: Algorithms and Performance
DSP Development Tools
VLSI Building Blocks
DSP Architectures
DSP System Design
Education
Recent Advances in Sampling Theory and Applications
Steganography: Information Embedding, Digital Watermarking, and Data Hiding
Speech Under Stress
Physics-Based Signal Processing
DSP Chips, Architectures and Implementations
DSP Tools and Rapid Prototyping
Communication Technologies
Image and Video Technologies
Automotive Applications / Industrial Signal Processing
Speech and Audio Technologies
Defense and Security Applications
Biomedical Applications
Voice and Media Processing
Adaptive Interference Cancellation
5: Communications, Sensor Array and Multichannel
Source Coding and Compression
Compression and Modulation
Channel Estimation and Equalization
Blind Multiuser Communications
Signal Processing for Communications I
CDMA and Space-Time Processing
Time-Varying Channels and Self-Recovering Receivers
Signal Processing for Communications II
Blind CDMA and Multi-Channel Equalization
Multicarrier Communications
Detection, Classification, Localization, and Tracking
Radar and Sonar Signal Processing
Array Processing: Direction Finding
Array Processing Applications I
Blind Identification, Separation, and Equalization
Antenna Arrays for Communications
Array Processing Applications II
6: Multimedia Signal Processing, Image and Multidimensional Signal Processing, Digital Signal Processing Education
Multimedia Analysis and Retrieval
Audio and Video Processing for Multimedia Applications
Advanced Techniques in Multimedia
Video Compression and Processing
Image Coding
Transform Techniques
Restoration and Estimation
Image Analysis
Object Identification and Tracking
Motion Estimation
Medical Imaging
Image and Multidimensional Signal Processing Applications I
Segmentation
Image and Multidimensional Signal Processing Applications II
Facial Recognition and Analysis
Digital Signal Processing Education

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

On The Limits Of Speech Recognition In Noise

Authors:

Stephen Douglas Peters,
Peter Stubley,
Jean-Marc Valin,

Page (NA) Paper number 1026

Abstract:

In this article, we consider the performance of speech recognition in noise and focus on its sensitivity to the acoustic feature set. In particular, we examine the perceived information reduction imposed on a speech signal using a feature extraction method commonly used for automatic speech recognition. We observe that the human recognition rates on noisy digit strings drop considerably as the speech signal undergoes the typical loss of phase and loss of frequency resolution. Steps are taken to ensure that human subjects are constrained in ways similar to that of an automatic recognizer. The high correlation between the performance of the human listeners and that of our connected digit recognizer leads us to some interesting conclusions, including that typical cepstral processing is insufficient to support speech information in noise.

IC991026.PDF (From Author) IC991026.PDF (Rasterized)

TOP


Recognition Of Spectrally Degraded Speech In Noise With Nonlinear Amplitude Mapping

Authors:

Qian-Jie Fu,
Robert V. Shannon,

Page (NA) Paper number 1191

Abstract:

The present study measured phoneme recognition as a function of signal-to-noise level under conditions of spectral smearing and nonlinear amplitude mapping. Speech sounds were divided into 16 analysis bands. The envelope was extracted from each band by half-wave rectification and low-pass filtering and was then distorted by a power-law transformation whose exponents varied from a strongly compressive (p=0.3) to a strongly expanded value (p=3.0). This distorted envelope was used to modulate a noise which was spectrally limited by the same analysis filters. Results showed that phoneme recognition scores in quiet were reduced only slightly with either expanded or compressed amplitude mapping. As the level of background noise was increased, performance deteriorated more rapidly for both compressed and linear mapping than for the expanded mapping. These results indicate that, although an expansive amplitude mapping may slightly reduce performance in quiet, it may be beneficial in noisy listening conditions.

IC991191.PDF (From Author) IC991191.PDF (Rasterized)

TOP


Phrase Splicing and Variable Substitution Using the IBM Trainable Speech Synthesis System

Authors:

Robert E Donovan,
Martin Franz,
Jeffrey S Sorensen,
Salim Roukos,

Page (NA) Paper number 1308

Abstract:

This paper describes a phrase splicing and variable substitution system which offers an intermediate form of automated speech production lying in-between the extremes of recorded utterance playback and full Text-to-Speech synthesis. The system incorporates a trainable speech synthesiser and an application specific set of pre-recorded phrases. The text to be synthesised is converted to a phone sequence using phone sequences present in the pre-recorded phrases wherever possible, and a pronunciation dictionary elsewhere. The synthesis inventory of the synthesiser is augmented with the synthesis information associated with the pre-recorded phrases used to construct the phone sequence. The synthesiser then performs a dynamic programming search over the augmented inventory to select a segment sequence to produce the output speech. The system enables the seamless splicing of pre-recorded phrases both with other phrases and with synthetic speech. It enables very high quality speech to be produced automatically within a limited domain.

IC991308.PDF (From Author) IC991308.PDF (Rasterized)

TOP


Assessment And Correction Of Voice Quality Variabilities In Large Speech Databases For Concatenative Speech Synthesis

Authors:

Yannis G Stylianou,

Page (NA) Paper number 1335

Abstract:

In an effort to increase the naturalness of concatenative speech synthesis, large speech databases may be recorded. While it is desirable to have varied prosodic and spectral characteristics in the database, it is not desirable to have variable voice quality. In this paper we present an automatic method for voice quality assessment and correction, whenever necessary, of large speech databases for concatenative speech synthesis. The proposed method is based on the use of a Gaussian Mixture Model to model the acoustic space of the speaker of the database and on autoregressive filters for compensation. An objective method to measure the effectiveness of the database correction based on a likelihood function for the speaker's GMM, is presented as well. Both objective and subjective results show that the proposed method succeeds in detecting voice quality problems and successfully corrects them. Results show a 14.2% improvement of the log-likelihood function after compensation.

IC991335.PDF (From Author) IC991335.PDF (Rasterized)

TOP


Shape Invariant Time-Scale Modification of Speech Using a Harmonic Model

Authors:

Darragh O'Brien,
Alex Monaghan,

Page (NA) Paper number 1527

Abstract:

A new and simple approach to shape invariant time-scale modification of speech is presented. The method, based upon a harmonic coding of each speech frame, operates entirely within the original sinusoidal model and makes no use of "pitch-pulse onset times" used by conventional algorithms. Instead, phase coherence, and thus shape invariance are ensured by exploiting the harmonic relation existing between the sine waves to cause them to be in phase at each adjusted frame boundary. Results suggest this approach to be an excellent candidate for use use within a concatenative text-to-speech synthesiser where scaling factors typically lie within a range well handled by this algorithm.

IC991527.PDF (From Author) IC991527.PDF (Rasterized)

TOP


Using a Sigmoid Transformation for Improved Modeling of Phoneme Duration

Authors:

Kim E.A Silverman,
Jerome R Bellegarda,

Page (NA) Paper number 1753

Abstract:

Over the past few years, the "sums-of-products" approach has emerged as one of the most promising avenues to model contextual influences on phoneme duration. The associated regression is generally applied after log-transforming the durations. This paper presents empirical and theoretical evidence which suggests that this transformation is not optimal. A promising alternative solution is proposed, based on a sigmoid function. Preliminary experimental results obtained on over 50,000 phonemes in varied prosodic contexts show that this transformation reduces the unexplained deviations in the data by more than 30%. Alternatively, for a given level of performance, it halves the number of parameters required by the model.

IC991753.PDF (From Author) IC991753.PDF (Rasterized)

TOP


Nonlinear Dynamic Modeling Of The Voiced Excitation For Improved Speech Synthesis

Authors:

Karthik Narasimhan,
Jose C. Principe,
Donald G. Childers,

Page (NA) Paper number 2386

Abstract:

This paper describes the implementation of a waveform-based global dynamic model with the goal of capturing vocal folds variability. The residue extracted from speech by inverse filtering is pre-processed to remove phoneme-dependence and is used as the input time series to the dynamic model. After training, the dynamic model is seeded with a point from the trajectory of the time series, and iterated to produce the synthetic excitation waveform. The output of the dynamic model is compared with the input time series. These comparisons confirmed that the dynamic model had captured the variability in the residue. The output of the dynamic models is used to synthesize speech using a pitch-synchronous speech synthesizer, and the output is observed to be close to natural speech.

IC992386.PDF (From Author) IC992386.PDF (Rasterized)

TOP


Results On Perceptual Invariants To Transformations On Speech

Authors:

Arnaud Robert, CIRC Group, EPFL, Switzerland (Switzerland)

Page (NA) Paper number 2463

Abstract:

This paper presents results of a study on perceptual invariants to transformations on the speech signal. A set of psychoacoustic tests were conducted as to put forward these invariants for the human hearing system (HS). The starting point is the decomposition of speech by an AM-FM analysis, rather than the use of more standard analysis methods. The main result of this work is the finding that our HS is robust to - namely our perception is not altered by - instantaneous frequency (IF) changes within a certain range, even though these resulted in substantial waveform modifications. This stimulated us to conduct further study on how standard analysis methods would cope with perceptually invariant changes; results show that, in fact, they are not robust to such changes. Finally, some applications of IF changes are proposed.

IC992463.PDF (From Author) IC992463.PDF (Rasterized)

TOP