ICASSP99 Speech Production and Synthesis

Speech Production and Synthesis
Home Full List of Titles 1: Speech Processing CELP Coding Large Vocabulary Recognition Speech Analysis and Enhancement Acoustic Modeling I ASR Systems and Applications Topics in Speech Coding Speech Analysis Low Bit Rate Speech Coding I Robust Speech Recognition in Noisy Environments Speaker Recognition Acoustic Modeling II Speech Production and Synthesis Feature Extraction Robust Speech Recognition and Adaptation Low Bit Rate Speech Coding II Speech Understanding Language Modeling I 2: Speech Processing, Audio and Electroacoustics, and Neural Networks Acoustic Modeling III Lexical Issues/Search Speech Understanding and Systems Speech Analysis and Quantization Utterance Verification/Acoustic Modeling Language Modeling II Adaptation /Normalization Speech Enhancement Topics in Speaker and Language Recognition Echo Cancellation and Noise Control Coding Auditory Modeling, Hearing Aids and Applications of Signal Processing to Audio and Acoustics Spatial Audio Music Applications Application - Pattern Recognition & Speech Processing Theory & Neural Architecture Signal Separation Application - Image & Nonlinear Signal Processing 3: Signal Processing Theory & Methods I Filter Design and Structures Detection Wavelets Adaptive Filtering: Applications and Implementation Nonlinear Signals and Systems Time/Frequency and Time/Scale Analysis Signal Modeling and Representation Filterbank and Wavelet Applications Source and Signal Separation Filterbanks Emerging Applications and Fast Algorithms Frequency and Phase Estimation Spectral Analysis and Higher Order Statistics Signal Reconstruction Adaptive Filter Analysis Transforms and Statistical Estimation Markov and Bayesian Estimation and Classification 4: Signal Processing Theory & Methods II, Design and Implementation of Signal Processing Systems, Special Sessions, and Industry Technology Tracks System Identification, Equalization, and Noise Suppression Parameter Estimation Adaptive Filters: Algorithms and Performance DSP Development Tools VLSI Building Blocks DSP Architectures DSP System Design Education Recent Advances in Sampling Theory and Applications Steganography: Information Embedding, Digital Watermarking, and Data Hiding Speech Under Stress Physics-Based Signal Processing DSP Chips, Architectures and Implementations DSP Tools and Rapid Prototyping Communication Technologies Image and Video Technologies Automotive Applications / Industrial Signal Processing Speech and Audio Technologies Defense and Security Applications Biomedical Applications Voice and Media Processing Adaptive Interference Cancellation 5: Communications, Sensor Array and Multichannel Source Coding and Compression Compression and Modulation Channel Estimation and Equalization Blind Multiuser Communications Signal Processing for Communications I CDMA and Space-Time Processing Time-Varying Channels and Self-Recovering Receivers Signal Processing for Communications II Blind CDMA and Multi-Channel Equalization Multicarrier Communications Detection, Classification, Localization, and Tracking Radar and Sonar Signal Processing Array Processing: Direction Finding Array Processing Applications I Blind Identification, Separation, and Equalization Antenna Arrays for Communications Array Processing Applications II 6: Multimedia Signal Processing, Image and Multidimensional Signal Processing, Digital Signal Processing Education Multimedia Analysis and Retrieval Audio and Video Processing for Multimedia Applications Advanced Techniques in Multimedia Video Compression and Processing Image Coding Transform Techniques Restoration and Estimation Image Analysis Object Identification and Tracking Motion Estimation Medical Imaging Image and Multidimensional Signal Processing Applications I Segmentation Image and Multidimensional Signal Processing Applications II Facial Recognition and Analysis Digital Signal Processing Education Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z	On The Limits Of Speech Recognition In Noise Authors: Stephen Douglas Peters, Peter Stubley, Jean-Marc Valin, Page (NA) Paper number 1026 Abstract: In this article, we consider the performance of speech recognition in noise and focus on its sensitivity to the acoustic feature set. In particular, we examine the perceived information reduction imposed on a speech signal using a feature extraction method commonly used for automatic speech recognition. We observe that the human recognition rates on noisy digit strings drop considerably as the speech signal undergoes the typical loss of phase and loss of frequency resolution. Steps are taken to ensure that human subjects are constrained in ways similar to that of an automatic recognizer. The high correlation between the performance of the human listeners and that of our connected digit recognizer leads us to some interesting conclusions, including that typical cepstral processing is insufficient to support speech information in noise. IC991026.PDF (From Author) IC991026.PDF (Rasterized) TOP Recognition Of Spectrally Degraded Speech In Noise With Nonlinear Amplitude Mapping Authors: Qian-Jie Fu, Robert V. Shannon, Page (NA) Paper number 1191 Abstract: The present study measured phoneme recognition as a function of signal-to-noise level under conditions of spectral smearing and nonlinear amplitude mapping. Speech sounds were divided into 16 analysis bands. The envelope was extracted from each band by half-wave rectification and low-pass filtering and was then distorted by a power-law transformation whose exponents varied from a strongly compressive (p=0.3) to a strongly expanded value (p=3.0). This distorted envelope was used to modulate a noise which was spectrally limited by the same analysis filters. Results showed that phoneme recognition scores in quiet were reduced only slightly with either expanded or compressed amplitude mapping. As the level of background noise was increased, performance deteriorated more rapidly for both compressed and linear mapping than for the expanded mapping. These results indicate that, although an expansive amplitude mapping may slightly reduce performance in quiet, it may be beneficial in noisy listening conditions. IC991191.PDF (From Author) IC991191.PDF (Rasterized) TOP Phrase Splicing and Variable Substitution Using the IBM Trainable Speech Synthesis System Authors: Robert E Donovan, Martin Franz, Jeffrey S Sorensen, Salim Roukos, Page (NA) Paper number 1308 Abstract: This paper describes a phrase splicing and variable substitution system which offers an intermediate form of automated speech production lying in-between the extremes of recorded utterance playback and full Text-to-Speech synthesis. The system incorporates a trainable speech synthesiser and an application specific set of pre-recorded phrases. The text to be synthesised is converted to a phone sequence using phone sequences present in the pre-recorded phrases wherever possible, and a pronunciation dictionary elsewhere. The synthesis inventory of the synthesiser is augmented with the synthesis information associated with the pre-recorded phrases used to construct the phone sequence. The synthesiser then performs a dynamic programming search over the augmented inventory to select a segment sequence to produce the output speech. The system enables the seamless splicing of pre-recorded phrases both with other phrases and with synthetic speech. It enables very high quality speech to be produced automatically within a limited domain. IC991308.PDF (From Author) IC991308.PDF (Rasterized) TOP Assessment And Correction Of Voice Quality Variabilities In Large Speech Databases For Concatenative Speech Synthesis Authors: Yannis G Stylianou, Page (NA) Paper number 1335 Abstract: In an effort to increase the naturalness of concatenative speech synthesis, large speech databases may be recorded. While it is desirable to have varied prosodic and spectral characteristics in the database, it is not desirable to have variable voice quality. In this paper we present an automatic method for voice quality assessment and correction, whenever necessary, of large speech databases for concatenative speech synthesis. The proposed method is based on the use of a Gaussian Mixture Model to model the acoustic space of the speaker of the database and on autoregressive filters for compensation. An objective method to measure the effectiveness of the database correction based on a likelihood function for the speaker's GMM, is presented as well. Both objective and subjective results show that the proposed method succeeds in detecting voice quality problems and successfully corrects them. Results show a 14.2% improvement of the log-likelihood function after compensation. IC991335.PDF (From Author) IC991335.PDF (Rasterized) TOP Shape Invariant Time-Scale Modification of Speech Using a Harmonic Model Authors: Darragh O'Brien, Alex Monaghan, Page (NA) Paper number 1527 Abstract: A new and simple approach to shape invariant time-scale modification of speech is presented. The method, based upon a harmonic coding of each speech frame, operates entirely within the original sinusoidal model and makes no use of "pitch-pulse onset times" used by conventional algorithms. Instead, phase coherence, and thus shape invariance are ensured by exploiting the harmonic relation existing between the sine waves to cause them to be in phase at each adjusted frame boundary. Results suggest this approach to be an excellent candidate for use use within a concatenative text-to-speech synthesiser where scaling factors typically lie within a range well handled by this algorithm. IC991527.PDF (From Author) IC991527.PDF (Rasterized) TOP Using a Sigmoid Transformation for Improved Modeling of Phoneme Duration Authors: Kim E.A Silverman, Jerome R Bellegarda, Page (NA) Paper number 1753 Abstract: Over the past few years, the "sums-of-products" approach has emerged as one of the most promising avenues to model contextual influences on phoneme duration. The associated regression is generally applied after log-transforming the durations. This paper presents empirical and theoretical evidence which suggests that this transformation is not optimal. A promising alternative solution is proposed, based on a sigmoid function. Preliminary experimental results obtained on over 50,000 phonemes in varied prosodic contexts show that this transformation reduces the unexplained deviations in the data by more than 30%. Alternatively, for a given level of performance, it halves the number of parameters required by the model. IC991753.PDF (From Author) IC991753.PDF (Rasterized) TOP Nonlinear Dynamic Modeling Of The Voiced Excitation For Improved Speech Synthesis Authors: Karthik Narasimhan, Jose C. Principe, Donald G. Childers, Page (NA) Paper number 2386 Abstract: This paper describes the implementation of a waveform-based global dynamic model with the goal of capturing vocal folds variability. The residue extracted from speech by inverse filtering is pre-processed to remove phoneme-dependence and is used as the input time series to the dynamic model. After training, the dynamic model is seeded with a point from the trajectory of the time series, and iterated to produce the synthetic excitation waveform. The output of the dynamic model is compared with the input time series. These comparisons confirmed that the dynamic model had captured the variability in the residue. The output of the dynamic models is used to synthesize speech using a pitch-synchronous speech synthesizer, and the output is observed to be close to natural speech. IC992386.PDF (From Author) IC992386.PDF (Rasterized) TOP Results On Perceptual Invariants To Transformations On Speech Authors: Arnaud Robert, CIRC Group, EPFL, Switzerland (Switzerland) Page (NA) Paper number 2463 Abstract: This paper presents results of a study on perceptual invariants to transformations on the speech signal. A set of psychoacoustic tests were conducted as to put forward these invariants for the human hearing system (HS). The starting point is the decomposition of speech by an AM-FM analysis, rather than the use of more standard analysis methods. The main result of this work is the finding that our HS is robust to - namely our perception is not altered by - instantaneous frequency (IF) changes within a certain range, even though these resulted in substantial waveform modifications. This stimulated us to conduct further study on how standard analysis methods would cope with perceptually invariant changes; results show that, in fact, they are not robust to such changes. Finally, some applications of IF changes are proposed. IC992463.PDF (From Author) IC992463.PDF (Rasterized) TOP