1:00, SPEECH-P2.1
COMPUTING MEL-FREQUENCY CEPSTRAL COEFFICIENTS ON THE POWER SPECTRUM
S. MOLAU, M. PITZ, R. SCHLÜTER, H. NEY
In this paper we present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizer's front end by merging subsequent signal analysis steps into a single one. It avoids possible interpolation and discretization problems and results in a compact implementation.
We show that frequency warping schemes like vocal tract normalization (VTN) can be integrated easily in our concept without additional computational efforts. Recognition test results obtained with the RWTH large vocabulary speech recognition system are presented for two different corpora: The German VerbMobil II dev99 corpus, and the English North American Business News 94 20k development corpus.
1:00, SPEECH-P2.2
PLP COEFFICIENTS CAN BE QUANTIZED AT 400 BPS
W. GUNAWAN, M. HASEGAWA-JOHNSON
Previous work in wireless speech recognition has focused on two methods, namely, quantizing recognition features (e.g. MFCC) or performing recognition using speech coding parameters (e.g. LPC). All of this previous research assumes that the communication channel is only large enough to transmit either speech coding parameters or speech recognition parameters. By contrast, we propose that the speech recognition parameters can be quantized at a rate sufficiently low to allow transmission of both speech coding and speech recognition parameters over a standard cellular channel. In particular, this paper shows that the perceptual LPC (PLP) coefficients can be transmitted at 400 bps with an insignificant loss of digit recognition accuracy.
1:00, SPEECH-P2.3
ROBUST CLASSIFICATION OF STOP CONSONANTS USING AUDITORY-BASED SPEECH PROCESSING
A. ALI, J. VAN DER SPIEGEL, P. MUELLER
In this work, a feature-based system for the automatic classification of stop consonants, in speaker independent continuous speech, is reported. The system uses a new auditory-based speech processing front-end that is based on the biologically rooted property of average localized synchrony detection (ALSD). It incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved, statistically, to be rich in their information content. The experiments are performed on stop consonants extracted from the TIMIT database with additive white Gaussian noise at various signal-to-noise ratios. The obtained classification accuracy compares favorably with previous work. The results also showed a consistent improvement of 3% in the place detection over the Generalized Synchrony Detector (GSD) system under identical circumstances on clean and noisy speech. This illustrates the superior ability of the ALSD to suppress the spurious peaks and produce a consistent and robust formant (peak) representation.
1:00, SPEECH-P2.4
ROBUST FEATURE EXTRACTION USING SUBBAND SPECTRAL CENTROID HISTOGRAMS
B. GAJIC, K. PALIWAL
In this paper we propose a new framework for utilizing frequency
information from the short-term power spectrum of speech. Feature
extraction is based on the cepstral coefficients derived from the
histograms of subband spectral centroids (SSC). Two new feature
extraction algorithms are proposed, one based on frequency information
alone, and the other which efficiently combines the frequency and
amplitude information from the speech power spectrum. Experimental
study on an automatic speech recognition task has shown that the
proposed methods outperform the conventional speech front-ends in
presence of additive white noise, while they perform comparably in the
noise-free conditions.
1:00, SPEECH-P2.5
EXTRACTION OF PITCH INFORMATION IN NOISY SPEECH USING WAVELET TRANSFORM WITH ALIASING COMPENSATION
S. CHEN, J. WANG
Although many wavelet-based pitch detection methods have been proposed in the literatures, there still remains a need to investigate new wavelet-based methods for more accurate and more robust pitch determination. In this paper, an improved wavelet-based method is developed for extraction of pitch information in noisy speech. At each decomposition in the wavelet transform, an aliasing compensation algorithm is applied to approximate and detail signals, in which the distortion of aliasing due to downsampling and upsampling operations of the wavelet transform is eliminated. In addition, this paper utilizes the concept of spatial correlation function used in signal denoising to improve the performance of pitch detection in noisy environment. It is shown in various experimental results that this new type of method has a considerable performance improvement compared with other conventional methods and wavelet-based methods.
1:00, SPEECH-P2.6
A NOVEL SYLLABLE DURATION MODELING APPROACH FOR MANDARIN SPEECH
W. LAI, S. CHEN
In this paper, a novel syllable duration modeling approach for Mandarin speech is proposed. It explicitly takes several main affecting factors as multiplicative companding parameters and estimates all model parameters by an EM algorithm. Experimental results showed that the variance of the observed syllable duration was greatly reduced from 183.4 frame2 (1 frame = 5 ms) to 18.5 frame2 by eliminating effects from these affecting factors. Besides, the estimated companding values of these affecting factors agreed well to our prior linguistic knowledge. A preliminary study of applying the proposed model to predict syllable duration for TTS is also performed. Experimental results showed that it outperformed the conventional regressive prediction method. Lastly, an extension of the approach to incorporate initial and final duration modeling is presented. This leads to a better understanding of the relation between the companding factors of initial and final duration models and those of syllable duration model.
1:00, SPEECH-P2.7
ALL-POLE MODELLING OF MIXED EXCITATION SIGNALS
P. KABAL, B. KLEIJN
Conventional Linear Prediction (LP) techniques can fail to adequately model speech spectra when the model order is too low and/or when the input is periodic (voiced speech). In this paper, we view the LP modelling problem as a correlation matching problem. We introduce a correlation matching criterion which models the signal as a filtered mixture of a noise-like excitation and a periodic excitation. As such it is an extension of the Discrete All-Pole (DAP) modelling approach. The new technique provides a means to generate LP spectra that evolve more smoothly from frame to frame even when the excitation signal has a periodic component with changing period.
1:00, SPEECH-P2.8
THE STATISTICAL STRUCTURES OF MALE AND FEMALE SPEECH SIGNALS
T. LEE, G. JANG
The goal of this paper is to learn or adapt statistical features
of gender specific speech signals. The adaptation is performed
by finding basis functions that encode the speech signal such that
the resulting coefficients are statistically independent and the information redundancy is minimized. We use a flexible independent component analysis (ICA) algorithm to adapt the basis functions as well as the source coefficients for male and female speakers respectively. The learned features show significant differences in frequency and time span. Our results suggest that the male speech features can be described by Gabor-like wavelet filters whereas the female speech signal has a much longer time span. We present a detailed time-frequency analysis strongly suggesting that those features can be used to qualify and quantify gender-specific speech signal differences.
1:00, SPEECH-P2.9
POLE ZERO ESTIMATION FROM SPEECH SIGNALS BY AN ITERATIVE PROCEDURE
K. SCHNELL, A. LACROIX
An iterative procedure is discussed to estimate poles and zeros of a rational transfer function from speech signals, which takes advantage of the individual solutions of AR and MA processes. Besides speech, analyses of test signals are also carried out, which lead to optimal results. In contrast to Prony’s and related methods, the algorithm don’t presuppose a pair of an input and output signal. The proposed procedure is specialised for the analysis of periodic signals, though it can be applied to non-periodic signals, too. The algorithm combines two known partial solutions in an iterative way. The estimation of an all-pole model is obtained by the Burg method and the estimation of zeros by using the inverted signal in the spectral domain. It can be shown that the power spectrum of analyzed speech periods can be better modelled by poles and zeros especially with respect to the gaps in the spectrum.
1:00, SPEECH-P2.10
ON THE EFFECT OF STRESS ON CERTAIN MODULATION PARAMETERS OF SPEECH
K. GOPALAN
This paper reports the results of correlation between demodulated amplitude and frequency variations from the AM-FM speech model and the heart rate of a fighter aircraft flight controller. It has been found that the peak frequencies in the spectrum of the amplitude envelope of the model follow F0 regardless of the center frequency of analysis. This tracking of F0 gives a qualitative estimate of stress level - as measured by the heart rate - relative to a neutral state of stress, with no a priori knowledge of F0 or formants. Additionally, the mean values of F0 estimates at low and high heart rates increased significantly from those at neutral state. Formant tracking showed an increase in F3 at both high and low heart rates while F4 generally varied directly with heart rate.