Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-12 |
|
SP-12.1
|
On the limits of speech recognition in noise
Stephen D Peters,
Peter Stubley (Nortel Technology),
Jean-Marc Valin (Sherbrooke University)
In this article, we consider the performance of speech recognition in
noise and focus on its sensitivity to the acoustic feature set. In
particular, we examine the perceived information reduction imposed
on a speech signal using a feature extraction method commonly used
for automatic speech recognition. We observe that the human
recognition rates on noisy digit strings drop considerably as the
speech signal undergoes the typical loss of phase and loss of
frequency resolution. Steps are taken to ensure that human
subjects are constrained in ways similar to that of an automatic
recognizer. The high correlation between the performance of the
human listeners and that of our connected digit recognizer leads us
to some interesting conclusions, including that typical cepstral
processing is insufficient to support speech information in noise.
|
SP-12.2
|
Recognition of spectrally degraded speech in noise with nonlinear amplitude mapping
Qian-Jie Fu,
Robert V. Shannon (House Ear Institute)
The present study measured phoneme recognition as a function of signal-to-noise level under conditions of spectral smearing and nonlinear amplitude mapping. Speech sounds were divided into 16 analysis bands. The envelope was extracted from each band by half-wave rectification and low-pass filtering and was then distorted by a power-law transformation whose exponents varied from a strongly compressive (p=0.3) to a strongly expanded value (p=3.0). This distorted envelope was used to modulate a noise which was spectrally limited by the same analysis filters. Results showed that phoneme recognition scores in quiet were reduced only slightly with either expanded or compressed amplitude mapping. As the level of background noise was increased, performance deteriorated more rapidly for both compressed and linear mapping than for the expanded mapping. These results indicate that, although an expansive amplitude mapping may slightly reduce performance in quiet, it may be beneficial in noisy listening conditions.
|
SP-12.3
|
Phrase Splicing and Variable Substitution Using the IBM Trainable Speech Synthesis System
Robert E Donovan,
Martin Franz,
Jeffrey S Sorensen,
Salim Roukos (IBM TJ Watson Research Center)
This paper describes a phrase splicing and variable substitution
system which offers an intermediate form of automated speech
production lying in-between the extremes of recorded utterance
playback and full Text-to-Speech synthesis. The system incorporates a
trainable speech synthesiser and an application specific set of
pre-recorded phrases. The text to be synthesised is converted to a
phone sequence using phone sequences present in the pre-recorded
phrases wherever possible, and a pronunciation dictionary elsewhere.
The synthesis inventory of the synthesiser is augmented with the
synthesis information associated with the pre-recorded phrases used to
construct the phone sequence. The synthesiser then performs a dynamic
programming search over the augmented inventory to select a segment
sequence to produce the output speech. The system enables the
seamless splicing of pre-recorded phrases both with other phrases and
with synthetic speech. It enables very high quality speech to be
produced automatically within a limited domain.
|
SP-12.4
|
Assessment and correction of Voice quality variabilities in large speech databases for concatenative speech synthesis
Yannis G Stylianou (AT?T Labs-Research)
In an effort to increase the naturalness of concatenative
speech synthesis, large speech databases may be recorded.
While it is desirable to have varied prosodic
and spectral characteristics in the database, it is not
desirable to have variable voice quality.
In this paper we present an automatic method for voice quality
assessment and correction, whenever necessary, of large speech
databases for concatenative speech synthesis. The proposed method
is based on the use of a Gaussian Mixture Model to model
the acoustic space of the speaker of the database and on
autoregressive filters for compensation. An objective method
to measure the effectiveness of the database correction based on
a likelihood function for the speaker's GMM, is presented as well.
Both objective and subjective results show that the proposed method
succeeds in detecting voice quality problems and successfully corrects
them. Results show a 14.2% improvement of the log-likelihood function
after compensation.
|
SP-12.5
|
Shape Invariant Time-Scale Modification of Speech Using a Harmonic Model
Darragh O'Brien,
Alex Monaghan (Dublin City University)
A new and simple approach to shape invariant time-scale
modification of speech is presented. The method, based
upon a harmonic coding of each speech frame, operates
entirely within the original sinusoidal model and
makes no use of "pitch-pulse onset times" used by
conventional algorithms. Instead, phase coherence, and
thus shape invariance are ensured by exploiting the
harmonic relation existing between the sine waves to
cause them to be in phase at each adjusted frame
boundary. Results suggest this approach to be an
excellent candidate for use use within a concatenative
text-to-speech synthesiser where scaling factors
typically lie within a range well handled by this
algorithm.
|
SP-12.6
|
Using a Sigmoid Transformation for Improved Modeling of Phoneme Duration
Kim E.A Silverman,
Jerome R Bellegarda (Apple Computer)
Over the past few years, the "sums-of-products" approach has emerged
as one of the most promising avenues to model contextual influences on
phoneme duration. The associated regression is generally applied after
log-transforming the durations. This paper presents empirical and
theoretical evidence which suggests that this transformation is not
optimal. A promising alternative solution is proposed, based on a
sigmoid function. Preliminary experimental results obtained on over
50,000 phonemes in varied prosodic contexts show that this
transformation reduces the unexplained deviations in the data by more
than 30%. Alternatively, for a given level of performance, it halves
the number of parameters required by the model.
|
SP-12.7
|
Nonlinear dynamic modeling of the voiced excitation for improved speech synthesis
Karthik Narasimhan,
Jose C. Principe (Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida),
Donald G. Childers (Mind Machine Center, Department of Electrical and Computer Engineering, University of Florida)
This paper describes the implementation of a waveform-based global dynamic model with the goal of capturing vocal folds variability. The residue extracted from speech by inverse filtering is pre-processed to remove phoneme-dependence and is used as the input time series to the dynamic model. After training, the dynamic model is seeded with a point from the trajectory of the time series, and iterated to produce the synthetic excitation waveform. The output of the dynamic model is compared with the input time series. These comparisons confirmed that the dynamic model had captured the variability in the residue. The output of the dynamic models is used to synthesize speech using a pitch-synchronous speech synthesizer, and the output is observed to be close to natural speech.
|
SP-12.8
|
Results on perceptual invariants to transformations on speech
Arnaud ROBERT (CIRC Group, EPFL, Switzerland)
This paper presents results of a study on perceptual invariants to
transformations on the speech signal. A set of psychoacoustic
tests were conducted as to put forward these invariants for the
human hearing system (HS). The starting point is the decomposition
of speech by an AM-FM analysis, rather than the use of more
standard analysis methods. The main result of this work is the
finding that our HS is robust to - namely our perception is not
altered by - instantaneous frequency (IF) changes within a certain
range, even though these resulted in substantial waveform
modifications. This stimulated us to conduct further study on how
standard analysis methods would cope with perceptually invariant
changes; results show that, in fact, they are not robust to such
changes. Finally, some applications of IF changes are proposed.
|
|