SP-12.1

On the limits of speech recognition in noise
Stephen D Peters, Peter Stubley (Nortel Technology), Jean-Marc Valin (Sherbrooke University)

In this article, we consider the performance of speech recognition in noise and focus on its sensitivity to the acoustic feature set. In particular, we examine the perceived information reduction imposed on a speech signal using a feature extraction method commonly used for automatic speech recognition. We observe that the human recognition rates on noisy digit strings drop considerably as the speech signal undergoes the typical loss of phase and loss of frequency resolution. Steps are taken to ensure that human subjects are constrained in ways similar to that of an automatic recognizer. The high correlation between the performance of the human listeners and that of our connected digit recognizer leads us to some interesting conclusions, including that typical cepstral processing is insufficient to support speech information in noise.

SP-12.2

Recognition of spectrally degraded speech in noise with nonlinear amplitude mapping
Qian-Jie Fu, Robert V. Shannon (House Ear Institute)

The present study measured phoneme recognition as a function of signal-to-noise level under conditions of spectral smearing and nonlinear amplitude mapping. Speech sounds were divided into 16 analysis bands. The envelope was extracted from each band by half-wave rectification and low-pass filtering and was then distorted by a power-law transformation whose exponents varied from a strongly compressive (p=0.3) to a strongly expanded value (p=3.0). This distorted envelope was used to modulate a noise which was spectrally limited by the same analysis filters. Results showed that phoneme recognition scores in quiet were reduced only slightly with either expanded or compressed amplitude mapping. As the level of background noise was increased, performance deteriorated more rapidly for both compressed and linear mapping than for the expanded mapping. These results indicate that, although an expansive amplitude mapping may slightly reduce performance in quiet, it may be beneficial in noisy listening conditions.

SP-12.3

Phrase Splicing and Variable Substitution Using the IBM Trainable Speech Synthesis System
Robert E Donovan, Martin Franz, Jeffrey S Sorensen, Salim Roukos (IBM TJ Watson Research Center)

This paper describes a phrase splicing and variable substitution system which offers an intermediate form of automated speech production lying in-between the extremes of recorded utterance playback and full Text-to-Speech synthesis. The system incorporates a trainable speech synthesiser and an application specific set of pre-recorded phrases. The text to be synthesised is converted to a phone sequence using phone sequences present in the pre-recorded phrases wherever possible, and a pronunciation dictionary elsewhere. The synthesis inventory of the synthesiser is augmented with the synthesis information associated with the pre-recorded phrases used to construct the phone sequence. The synthesiser then performs a dynamic programming search over the augmented inventory to select a segment sequence to produce the output speech. The system enables the seamless splicing of pre-recorded phrases both with other phrases and with synthetic speech. It enables very high quality speech to be produced automatically within a limited domain.

SP-12.4

Assessment and correction of Voice quality variabilities in large speech databases for concatenative speech synthesis
Yannis G Stylianou (AT?T Labs-Research)

In an effort to increase the naturalness of concatenative speech synthesis, large speech databases may be recorded. While it is desirable to have varied prosodic and spectral characteristics in the database, it is not desirable to have variable voice quality. In this paper we present an automatic method for voice quality assessment and correction, whenever necessary, of large speech databases for concatenative speech synthesis. The proposed method is based on the use of a Gaussian Mixture Model to model the acoustic space of the speaker of the database and on autoregressive filters for compensation. An objective method to measure the effectiveness of the database correction based on a likelihood function for the speaker's GMM, is presented as well. Both objective and subjective results show that the proposed method succeeds in detecting voice quality problems and successfully corrects them. Results show a 14.2% improvement of the log-likelihood function after compensation.

SP-12.5

Shape Invariant Time-Scale Modification of Speech Using a Harmonic Model
Darragh O'Brien, Alex Monaghan (Dublin City University)

A new and simple approach to shape invariant time-scale modification of speech is presented. The method, based upon a harmonic coding of each speech frame, operates entirely within the original sinusoidal model and makes no use of "pitch-pulse onset times" used by conventional algorithms. Instead, phase coherence, and thus shape invariance are ensured by exploiting the harmonic relation existing between the sine waves to cause them to be in phase at each adjusted frame boundary. Results suggest this approach to be an excellent candidate for use use within a concatenative text-to-speech synthesiser where scaling factors typically lie within a range well handled by this algorithm.

SP-12.6

Using a Sigmoid Transformation for Improved Modeling of Phoneme Duration
Kim E.A Silverman, Jerome R Bellegarda (Apple Computer)

Over the past few years, the "sums-of-products" approach has emerged as one of the most promising avenues to model contextual influences on phoneme duration. The associated regression is generally applied after log-transforming the durations. This paper presents empirical and theoretical evidence which suggests that this transformation is not optimal. A promising alternative solution is proposed, based on a sigmoid function. Preliminary experimental results obtained on over 50,000 phonemes in varied prosodic contexts show that this transformation reduces the unexplained deviations in the data by more than 30%. Alternatively, for a given level of performance, it halves the number of parameters required by the model.

SP-12.7

Nonlinear dynamic modeling of the voiced excitation for improved speech synthesis
Karthik Narasimhan, Jose C. Principe (Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida), Donald G. Childers (Mind Machine Center, Department of Electrical and Computer Engineering, University of Florida)

This paper describes the implementation of a waveform-based global dynamic model with the goal of capturing vocal folds variability. The residue extracted from speech by inverse filtering is pre-processed to remove phoneme-dependence and is used as the input time series to the dynamic model. After training, the dynamic model is seeded with a point from the trajectory of the time series, and iterated to produce the synthetic excitation waveform. The output of the dynamic model is compared with the input time series. These comparisons confirmed that the dynamic model had captured the variability in the residue. The output of the dynamic models is used to synthesize speech using a pitch-synchronous speech synthesizer, and the output is observed to be close to natural speech.

SP-12.8

Results on perceptual invariants to transformations on speech
Arnaud ROBERT (CIRC Group, EPFL, Switzerland)

This paper presents results of a study on perceptual invariants to transformations on the speech signal. A set of psychoacoustic tests were conducted as to put forward these invariants for the human hearing system (HS). The starting point is the decomposition of speech by an AM-FM analysis, rather than the use of more standard analysis methods. The main result of this work is the finding that our HS is robust to - namely our perception is not altered by - instantaneous frequency (IF) changes within a certain range, even though these resulted in substantial waveform modifications. This stimulated us to conduct further study on how standard analysis methods would cope with perceptually invariant changes; results show that, in fact, they are not robust to such changes. Finally, some applications of IF changes are proposed.

< SP-11 SP-13 >

Last Update: February 4, 1999 Ingo Höntsch