Chair: Paul Mermelstein, INRS-Telecom (FRANCE)
Toshihiko Abe, Tokyo Institute of Technology (JAPAN)
Takao Kobayashi, Tokyo Institute of Technology (JAPAN)
Satoshi Imai, Tokyo Institute of Technology (JAPAN)
This paper proposes a technique for estimating the harmonic frequencies based on instantaneous frequency (IF) of a speech signal. The main problem is how to decompose the speech signal into the harmonic components. For this purpose, we use a set of bandpass-filters, each of whose center frequencies changes with time in such a way that it track the instantaneous frequency of its output. As a result, the outputs of the band-pass filters become the harmonic components, and the instantaneous frequencies of the harmonics are accurately estimated. To evaluate the effectiveness of the approach, we apply it to pitch extraction. The pitch extraction is simply accomplished by selecting the correct fundamental frequency out of the harmonic frequencies. The most significant feature of the pitch extractor is that the extracted pitch contour is smooth and it requires no post-processing such as nonlinear filtering or any smoothing processes.
C. d'Alessandro, LIMSI-CNRS (FRANCE)
B. Yegnanarayana, Indian Institute of Technology (INDIA)
V. Darsinos, University of Patras (GREECE)
This paper presents a new method for decomposition of the speech signal into a deterministic and a stochastic component. The method is based on iterative signal reconstruction. The method involves: (1) Separation of speech into an approximate excitation and filter components using Linear Predictive (LP) analysis; (2) Identification of frequency regions of noise and deterministic components of excitation using cepstrum; (3) Reconstruction of the two excitation components of the residual using an iterative algorithm; (4) Finally, the deterministic and stochastic components of the excitation are then obtained by combining the reconstructed frames of data using an overlap-add procedure. The deterministic and stochastic components are then passed through the time varying all-pole filter to obtain the components of the speech signal. The algorithm is able to decompose varying mixtures of stochastic and deterministic signals, like the noise bursts produced at the glottal closure and the deterministic glottal pulses. This new algorithm is a powerful tool for analysis of relevant features of the source component of speech signals.
Shan Lu, Purdue University (USA)
Peter C. Doerschuk, Purdue University (USA)
We describe a new approach to decomposing signals that are modeled as a sum of jointly amplitude and frequency modulated cosines with slowing-varying center frequencies observed in noise based on statistical nonlinear filtering ideas. We demonstrate the ideas on a formant tracking problem for the sentence "Where were you while we were away."
J.S. Erkelens, Delft University of Technology (THE NETHERLANDS)
P.M.T. Broersen, Delft University of Technology (THE NETHERLANDS)
In literature, the quantization properties of several representations of the LPC model have been studied. Until recently, best results have generally been obtained with the LSP frequencies. In scalar quantization schemes, the Immitance Spectrum Pairs (ISP's) perform even slightly better. The good quantization performance of LSP and ISP can be attributed to their theoretical statistical properties: they are uncorrelated when estimated from stationary autoregressive processes, in contrast to the other representations. For small variations in the coefficients of any representation, the Spectral Distortion can be expressed as a weighted squared distortion measure. The optimal weighting matrix is the inverse of the covariance matrix of the coefficients. For LSP and ISP this matrix is a diagonal matrix and hence the best weighting factors are the inverses of the theoretical variances. The difference between LSP and ISP is due to their distributions in speech.
Helen M. Hanson, Harvard University (USA)
We address the measurement of glottal characteristics of female speakers and how these characteristics contribute to voice quality or individuality. We have developed acoustic measurements of the voicing source that are made directly on the speech waveform or spectrum. These measures are based on theoretical models of speech production. Based on these measurements it is possible to make some inferences about the glottal configuration during phonation and the nature of the glottal pulse. Previous work has relied mainly on physiological methods such as inverse filtering of vocal tract airflow or observations of vocal fold vibration via endoscopy or fiberscopy. Our measures are non-invasive and can be easily extracted automatically. By comparing these acoustic measures to physiological and perceptual data, we show that they are valid.
B. Yegnanarayana, Indian Institute of Technology (INDIA)
R.L.H.M. Smits, Institute for Perception Research (THE NETHERLANDS)
In this paper we propose a method for determining the instants of significant excitation in speech signals. Here significant excitation refers primarily to the instants of glottal closure in voiced speech. The method computes the average slope of the unwrapped phase spectrum as a function of time. The instants where the phase slope function makes a positive zero-crossing correspond to the major excitations in the signal. For an analysis window size in the range of one to two pitch periods, these instants coincide with the instants of glottal closure in each pitch period. The method is robust, as it depends only on the average phase slope value, and further, it dpends only on the positive zero- crossing instants of the average phase slope function.
Vladimir Goncharoff, University of Illinois at Chicago (USA)
Maureen Kaine-Krolak, University of Illinois at Chicago (USA)
We present a new method for interpolating between LPC spectra via pole-shifting. This approach solves the problem of real-to-complex and complex-to-real pole transitions by converting to a domain where each pole has a complex conjugate. Desired pole shifts are calculated in the new domain after applying a perceptually-based pole pairing algorithm. Intermediate spectra corresponding to these pole transitions are then optimally approximated using the original number of poles. The resulting interpolated spectral sequence is characterized by approximately linear changes in formant frequencies and bandwidths, and is free of the artifacts that may occur with other LPC spectral parameter interpolation methods.
Alexandros Potamianos, Georgia Institute of Technology (USA)
Petros Maragos, Georgia Institute of Technology (USA)
In this paper, the AM--FM modulation model and a multiband analysis/demodulation scheme is applied to speech formant frequency and bandwidth tracking. Filtering is performed by a bank of Gabor bandpass filters. Each band is demodulated to amplitude envelope and instantaneous frequency signals using the energy separation algorithm. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals and their merits are presented. The estimates are used to determine the formant locations and bandwidths. Performance and computational issues (frequency domain implementation) are discussed. Overall, the multiband demodulation approach to formant tracking is easy to implement, provides accurate formant frequency and realistic bandwidth estimates, and performs well in the presence of nasalization.
Fernando Diaz-de-Mari, Universidaf de Cantabria
Anibal R. Figueiras-Vidal, Universidad Politecnica de Madrid (SPAIN)
Radial Basis Functions (RBF) networks constitute an interesting option to deal with nonlinear prediction of speech because they provide a regularized solution and, then, they can guarantee the stability of the corresponding synthesis scheme; consequently, they are adequate to be used in Code Excited Nonlinear Prediction (CENP) coders. In this paper this approach is presented, and some simulations examples show its advantage in prediction performance. After this, the main points to arrive to practical implementations of CENP coders are addressed.
Maria Rangoussi, National Technical University of Athens (GREECE)
Anastasios Delopoulos, National Technical University of Athens (GREECE)
Recognition of the unvoiced stop sounds /k/, /p/ and /t/ in a speech signal is an interesting problem, due to the irregular, aperiodic, nonstationary nature of the corresponding signals. Their spotting is much easier, however, thanks to the characteristic silence interval they include. Classification of these three phonemes is therefore proposed in the present paper, based on patterns extracted from their time - frequency representation. This is possible because the different articulation points of /k/, /p/ and /t/ are reflected into distinct patterns of evolution of their spectral contents with time. These patterns can be obtained by suitable time - frequency analysis, and then used for classification. The Wigner distribution of the unvoiced stop signals, appropriately smoothed and subsampled, is proposed here as the basic classification pattern. Finally, for the classification step, the Learning Vector Quantization (LVQ) classifier of Kohonen is employed on a set of unvoiced stop signals extracted from the TIMIT speech database, with encouraging results under context- and speaker- independent testing conditions.