3:30, SPEECH-P9.1
SIGNAL MODELING WITH NON-UNIFORM TOPOLOGY LATTICE FILTERS
S. KRSTULOVIC, F. BIMBOT
The article presents a new class of constrained and specialized
Auto-Regressive (AR) processes. They are derived from lattice filters where some reflection coefficients are forced to zero at a priori
locations. Optimizing the filter topology allows to build parametric
spectral models that have a greater number of poles than the number of
parameters needed to describe their location. These NUT (Non-Uniform
Topology) models are assessed by evaluating the reduction of modeling error with respect to conventional AR models.
3:30, SPEECH-P9.2
EXPERIMENTS AND MODELING OF PERCEIVED SPEECH QUALITY OF LONG SAMPLES
X. TAN, S. WÄNSTEDT, G. HEIKKILÄ
Speech quality in cellular nets may vary significantly over time. Accessing the perceived speech quality aggregated over time, such as for an entire conversation, is desired to ensure customer satisfaction. Calculating average quality using common objective methods, which normally determine quality for short speech samples, has drawbacks. Subjective listening tests with long speech segments show that the perceived quality differs from the average quality calculated from a series of objective measurements. The overall perceived quality is affected by the brain's "ability" to forget and, hence, the last 30 to 40 s of speech form the basis for the subjective quality.
3:30, SPEECH-P9.3
SUPPRESSION OF PHASINESS FOR TIME-SCALE MODIFICATIONS OF SPEECH SIGNALS BASED ON A SHAPE INVARIANCE PROPERTY
J. DI MARTINO, Y. LAPRIE
Time-scale modifications of speech signals, based on frequency-domain techniques, are hampered by two important artifacts which are ``phasiness'' and ``transient smearing''. They correspond to the destruction of the shape of the original signal, i.e. the de-synchronization between the phases of
frequency components. This paper describes an algorithm that preserves the shape invariance of speech signals in the context of a phase vocoder. Phases are corrected at the onset of each voiced region. Modified signals, even for large expansion factors, are of high quality and free from transient smearing or phasiness. A demonstration is proposed in the web page: http://www.loria.fr/\verb|~|jdm/PhaseVocoder/index.html where some audio files can be down-loaded.
3:30, SPEECH-P9.4
ROBUST SINGING DETECTION IN SPEECH/MUSIC DISCRIMINATOR DESIGN
W. CHOU, L. GU
In this paper, an approach for robust signing signal detection in speech/music discrimination is proposed and applied to applications of audio indexing. Conventional approaches often perform poorly with singing segments. This is due mainly to the fact that speech and singing signals are extremely close and traditional features used in speech recognition do not provide a reliable cue for speech and singing signal discrimination. A new set of features derived from harmonic coefficient and its 4Hz modulation values is developed in this paper, and these new features provide additional and reliable cues to separate speech from singing. In addition, a rule-based post-filtering scheme is also described which leads to further improvements in speech/music discrimination. Source-independent audio indexing experiments on PBS Skills database indicate that the proposed approach can greatly reduce the classification error rate on singing segments in the audio stream.
3:30, SPEECH-P9.5
ACOUSTIC-PHONETIC CHARACTERISTICS OF HYPERARTICULATED SPEECH FOR DIFFERENT SPEAKING STYLES
S. KOESTER
This study aims to describe differences between hyperarticulated and normal speech.
Hyperarticulated, or clear speech is produced when addressing to hearing-impaired listeners.
It also appears quite often in spoken language systems as the user?s reaction on previous
recognition errors. In this paper we present a comparison of the acoustic-phonetic
characteristics of normal and hyperarticulated speech for three different types of
utterances, single words, single sentences and spontaneous speech.
Duration, fundamental freqency, formants and formant bandwidths change significantely.
Significant differences between the three speaking styles are observable, especially for
spontaneous speech vs. words and sentences. We report on an auditory test investigating
the percieved changes in the two speech types.
3:30, SPEECH-P9.6
A ROBUST TECHNIQUE FOR HARMONIC ANALYSIS OF SPEECH
N. ABU-SHIKHAH, M. DERICHE
A technique named Least Squares Harmonic (LSH) is proposed for speech
decomposition. The problem of harmonic estimation for speech is
formulated as a solution to two sets of linear equations derived from
minimising the Mean Squared Error between original and estimated
signals. The algorithm assumes that a good initial estimate of the
pitch period is available. The performance of the algorithm is
comparable to that of the Total Least Square Prony method (TLSP) at
high Signal-to-Noise ratios, however, at very low SNR, the proposed
algorithm leads to a much more accurate harmonic representation. The
approach used here has a great potential in coding and recognition
applications.
3:30, SPEECH-P9.7
MIXTURE GAUSSIAN ENVELOPE CHIRP MODEL FOR SPEECH AND AUDIO
B. MONDAL, T. SREENIVAS
We develop a parametric sinusoidal analysis/synthesis model
which can be applied to both speech and audio signals. These signals are characterised by large amplitude variations and small frequency variation within a short analysis frame. The model comprises of a Gaussian mixture representation for the envelope and a sum
of linear chirps for the frequency components. A closed form solution is derived for the frequency domain parameters of a chirp with Gaussian-mixture envelope, based on the spectral moments. An iterative algorithm is developed to select and estimate prominent chirps based on the psycho-acoustic masking threshold. The model can adaptively select the number of time-domain and frequency-domain parameters to suit a particular type of signal. Experimental evaluation of the technique has shown that about 2 to 4 parameters/ms is sufficient for near transparent quality reconstruction of a variety of wide-band music and speech signals.
3:30, SPEECH-P9.8
GLOTTAL FLOW DERIVATIVE MODELING WITH THE WAVELET SMOOTHED EXCITATION
A. LOBO
This paper discusses a method for estimating glottal flow derivative model parameters using the wavelet-smoothed excitation. The excitation is first estimated using the Weighted Recursive Least Squares with Variable Forgetting Factor algorithm. The raw excitation is then smoothed by applying a Discrete Wavelet Transform (DWT) using Biorthogonal Quadrature filters, and a thresholding operation done on the DWT amplitude coefficients, followed by an inverse DWT. The pitch period and the instant of glottal closure (IGC) are estimated from the wavelet-smoothed excitation. A six-parameter glottal flow derivative model consisting of three amplitude and three timing parameters is aligned with the IGC and optimized by minimum square error fitting to the speech waveform. The optimization is done by the method of simulated annealing The model is then used to reestimate the vocal-tract filter parameters in an ARX procedure followed by further stages of voice source-vocal tract estimation. The results of analysis of speech utterances from the BK_TIMIT database will be presented.
3:30, SPEECH-P9.9
AVOIDING OVER-ESTIMATION IN BANDWIDTH EXTENSION OF TELEPHONY SPEECH
M. NILSSON, W. KLEIJN
In this paper we present a new way of treating the
problem of extending a narrow-band signal to a wide-band signal.
For many cases of bandwidth extension, the high-band energy is over-estimated, leading to undesirable audible artifacts. To overcome these problems we introduce an asymmetric cost-function in the estimation process of the high-band that penalizes over-estimates more than under-estimates of the energy in the high-band.
We show that the resulting attenuation of the estimated high-band energy depends on the broadness
of the a-posteriori distribution of the energy given the extracted information about the narrow-band.
Thus, the uncertainty about how to extend the signal at the high-band influences the level of extension.
Results from listening test show that the proposed algorithm produces less artifacts.