Authors:
Stephen Douglas Peters,
Peter Stubley,
Jean-Marc Valin,
Page (NA) Paper number 1026
Abstract:
In this article, we consider the performance of speech recognition
in noise and focus on its sensitivity to the acoustic feature set.
In particular, we examine the perceived information reduction imposed
on a speech signal using a feature extraction method commonly used
for automatic speech recognition. We observe that the human recognition
rates on noisy digit strings drop considerably as the speech signal
undergoes the typical loss of phase and loss of frequency resolution.
Steps are taken to ensure that human subjects are constrained in ways
similar to that of an automatic recognizer. The high correlation between
the performance of the human listeners and that of our connected digit
recognizer leads us to some interesting conclusions, including that
typical cepstral processing is insufficient to support speech information
in noise.
Authors:
Qian-Jie Fu,
Robert V. Shannon,
Page (NA) Paper number 1191
Abstract:
The present study measured phoneme recognition as a function of signal-to-noise
level under conditions of spectral smearing and nonlinear amplitude
mapping. Speech sounds were divided into 16 analysis bands. The envelope
was extracted from each band by half-wave rectification and low-pass
filtering and was then distorted by a power-law transformation whose
exponents varied from a strongly compressive (p=0.3) to a strongly
expanded value (p=3.0). This distorted envelope was used to modulate
a noise which was spectrally limited by the same analysis filters.
Results showed that phoneme recognition scores in quiet were reduced
only slightly with either expanded or compressed amplitude mapping.
As the level of background noise was increased, performance deteriorated
more rapidly for both compressed and linear mapping than for the expanded
mapping. These results indicate that, although an expansive amplitude
mapping may slightly reduce performance in quiet, it may be beneficial
in noisy listening conditions.
Authors:
Robert E Donovan,
Martin Franz,
Jeffrey S Sorensen,
Salim Roukos,
Page (NA) Paper number 1308
Abstract:
This paper describes a phrase splicing and variable substitution system
which offers an intermediate form of automated speech production lying
in-between the extremes of recorded utterance playback and full Text-to-Speech
synthesis. The system incorporates a trainable speech synthesiser and
an application specific set of pre-recorded phrases. The text to be
synthesised is converted to a phone sequence using phone sequences
present in the pre-recorded phrases wherever possible, and a pronunciation
dictionary elsewhere. The synthesis inventory of the synthesiser is
augmented with the synthesis information associated with the pre-recorded
phrases used to construct the phone sequence. The synthesiser then
performs a dynamic programming search over the augmented inventory
to select a segment sequence to produce the output speech. The system
enables the seamless splicing of pre-recorded phrases both with other
phrases and with synthetic speech. It enables very high quality speech
to be produced automatically within a limited domain.
Authors:
Yannis G Stylianou,
Page (NA) Paper number 1335
Abstract:
In an effort to increase the naturalness of concatenative speech synthesis,
large speech databases may be recorded. While it is desirable to have
varied prosodic and spectral characteristics in the database, it is
not desirable to have variable voice quality. In this paper we present
an automatic method for voice quality assessment and correction, whenever
necessary, of large speech databases for concatenative speech synthesis.
The proposed method is based on the use of a Gaussian Mixture Model
to model the acoustic space of the speaker of the database and on autoregressive
filters for compensation. An objective method to measure the effectiveness
of the database correction based on a likelihood function for the speaker's
GMM, is presented as well. Both objective and subjective results show
that the proposed method succeeds in detecting voice quality problems
and successfully corrects them. Results show a 14.2% improvement of
the log-likelihood function after compensation.
Authors:
Darragh O'Brien,
Alex Monaghan,
Page (NA) Paper number 1527
Abstract:
A new and simple approach to shape invariant time-scale modification
of speech is presented. The method, based upon a harmonic coding of
each speech frame, operates entirely within the original sinusoidal
model and makes no use of "pitch-pulse onset times" used by conventional
algorithms. Instead, phase coherence, and thus shape invariance are
ensured by exploiting the harmonic relation existing between the sine
waves to cause them to be in phase at each adjusted frame boundary.
Results suggest this approach to be an excellent candidate for use
use within a concatenative text-to-speech synthesiser where scaling
factors typically lie within a range well handled by this algorithm.
Authors:
Kim E.A Silverman,
Jerome R Bellegarda,
Page (NA) Paper number 1753
Abstract:
Over the past few years, the "sums-of-products" approach has emerged
as one of the most promising avenues to model contextual influences
on phoneme duration. The associated regression is generally applied
after log-transforming the durations. This paper presents empirical
and theoretical evidence which suggests that this transformation is
not optimal. A promising alternative solution is proposed, based on
a sigmoid function. Preliminary experimental results obtained on over
50,000 phonemes in varied prosodic contexts show that this transformation
reduces the unexplained deviations in the data by more than 30%. Alternatively,
for a given level of performance, it halves the number of parameters
required by the model.
Authors:
Karthik Narasimhan,
Jose C. Principe,
Donald G. Childers,
Page (NA) Paper number 2386
Abstract:
This paper describes the implementation of a waveform-based global
dynamic model with the goal of capturing vocal folds variability. The
residue extracted from speech by inverse filtering is pre-processed
to remove phoneme-dependence and is used as the input time series to
the dynamic model. After training, the dynamic model is seeded with
a point from the trajectory of the time series, and iterated to produce
the synthetic excitation waveform. The output of the dynamic model
is compared with the input time series. These comparisons confirmed
that the dynamic model had captured the variability in the residue.
The output of the dynamic models is used to synthesize speech using
a pitch-synchronous speech synthesizer, and the output is observed
to be close to natural speech.
Authors:
Arnaud Robert, CIRC Group, EPFL, Switzerland (Switzerland)
Page (NA) Paper number 2463
Abstract:
This paper presents results of a study on perceptual invariants to
transformations on the speech signal. A set of psychoacoustic tests
were conducted as to put forward these invariants for the human hearing
system (HS). The starting point is the decomposition of speech by an
AM-FM analysis, rather than the use of more standard analysis methods.
The main result of this work is the finding that our HS is robust to
- namely our perception is not altered by - instantaneous frequency
(IF) changes within a certain range, even though these resulted in
substantial waveform modifications. This stimulated us to conduct further
study on how standard analysis methods would cope with perceptually
invariant changes; results show that, in fact, they are not robust
to such changes. Finally, some applications of IF changes are proposed.
|