SPEECH ANALYSIS

Chair: Paul Mermelstein, INRS-Telecom (FRANCE)

Home


Harmonics Tracking and Pitch Extraction Based on Instantaneous Frequency

Authors:

Toshihiko Abe, Tokyo Institute of Technology (JAPAN)
Takao Kobayashi, Tokyo Institute of Technology (JAPAN)
Satoshi Imai, Tokyo Institute of Technology (JAPAN)

Volume 1, Page 756

Abstract:

This paper proposes a technique for estimating the harmonic frequencies based on instantaneous frequency (IF) of a speech signal. The main problem is how to decompose the speech signal into the harmonic components. For this purpose, we use a set of bandpass-filters, each of whose center frequencies changes with time in such a way that it track the instantaneous frequency of its output. As a result, the outputs of the band-pass filters become the harmonic components, and the instantaneous frequencies of the harmonics are accurately estimated. To evaluate the effectiveness of the approach, we apply it to pitch extraction. The pitch extraction is simply accomplished by selecting the correct fundamental frequency out of the harmonic frequencies. The most significant feature of the pitch extractor is that the extracted pitch contour is smooth and it requires no post-processing such as nonlinear filtering or any smoothing processes.

300dpi TIFF Images of pages:

756 757 758 759

Acrobat PDF file of whole paper:

ic950756.pdf

TOP



Decomposition of Speech Signals into Deterministic and Stochastic Components

Authors:

C. d'Alessandro, LIMSI-CNRS (FRANCE)
B. Yegnanarayana, Indian Institute of Technology (INDIA)
V. Darsinos, University of Patras (GREECE)

Volume 1, Page 760

Abstract:

This paper presents a new method for decomposition of the speech signal into a deterministic and a stochastic component. The method is based on iterative signal reconstruction. The method involves: (1) Separation of speech into an approximate excitation and filter components using Linear Predictive (LP) analysis; (2) Identification of frequency regions of noise and deterministic components of excitation using cepstrum; (3) Reconstruction of the two excitation components of the residual using an iterative algorithm; (4) Finally, the deterministic and stochastic components of the excitation are then obtained by combining the reconstructed frames of data using an overlap-add procedure. The deterministic and stochastic components are then passed through the time varying all-pole filter to obtain the components of the speech signal. The algorithm is able to decompose varying mixtures of stochastic and deterministic signals, like the noise bursts produced at the glottal closure and the deterministic glottal pulses. This new algorithm is a powerful tool for analysis of relevant features of the source component of speech signals.

300dpi TIFF Images of pages:

760 761 762 763

Acrobat PDF file of whole paper:

ic950760.pdf

TOP



Modeling and Processing Speech with Sums of AM- FM Formant Models

Authors:

Shan Lu, Purdue University (USA)
Peter C. Doerschuk, Purdue University (USA)

Volume 1, Page 764

Abstract:

We describe a new approach to decomposing signals that are modeled as a sum of jointly amplitude and frequency modulated cosines with slowing-varying center frequencies observed in noise based on statistical nonlinear filtering ideas. We demonstrate the ideas on a formant tracking problem for the sentence "Where were you while we were away."

300dpi TIFF Images of pages:

764 765 766 767

Acrobat PDF file of whole paper:

ic950764.pdf

TOP



On the Statistical Properties of Line Spectrum Pairs

Authors:

J.S. Erkelens, Delft University of Technology (THE NETHERLANDS)
P.M.T. Broersen, Delft University of Technology (THE NETHERLANDS)

Volume 1, Page 768

Abstract:

In literature, the quantization properties of several representations of the LPC model have been studied. Until recently, best results have generally been obtained with the LSP frequencies. In scalar quantization schemes, the Immitance Spectrum Pairs (ISP's) perform even slightly better. The good quantization performance of LSP and ISP can be attributed to their theoretical statistical properties: they are uncorrelated when estimated from stationary autoregressive processes, in contrast to the other representations. For small variations in the coefficients of any representation, the Spectral Distortion can be expressed as a weighted squared distortion measure. The optimal weighting matrix is the inverse of the covariance matrix of the coefficients. For LSP and ISP this matrix is a diagonal matrix and hence the best weighting factors are the inverses of the theoretical variances. The difference between LSP and ISP is due to their distributions in speech.

300dpi TIFF Images of pages:

768 769 770 771

Acrobat PDF file of whole paper:

ic950768.pdf

TOP



Individual Variations in Glottal Characteristics of Female Speakers

Authors:

Helen M. Hanson, Harvard University (USA)

Volume 1, Page 772

Abstract:

We address the measurement of glottal characteristics of female speakers and how these characteristics contribute to voice quality or individuality. We have developed acoustic measurements of the voicing source that are made directly on the speech waveform or spectrum. These measures are based on theoretical models of speech production. Based on these measurements it is possible to make some inferences about the glottal configuration during phonation and the nature of the glottal pulse. Previous work has relied mainly on physiological methods such as inverse filtering of vocal tract airflow or observations of vocal fold vibration via endoscopy or fiberscopy. Our measures are non-invasive and can be easily extracted automatically. By comparing these acoustic measures to physiological and perceptual data, we show that they are valid.

300dpi TIFF Images of pages:

772 773 774 775

Acrobat PDF file of whole paper:

ic950772.pdf

TOP



A Robust Method for Determining Instants of Major Excitations in Voiced Speech

Authors:

B. Yegnanarayana, Indian Institute of Technology (INDIA)
R.L.H.M. Smits, Institute for Perception Research (THE NETHERLANDS)

Volume 1, Page 776

Abstract:

In this paper we propose a method for determining the instants of significant excitation in speech signals. Here significant excitation refers primarily to the instants of glottal closure in voiced speech. The method computes the average slope of the unwrapped phase spectrum as a function of time. The instants where the phase slope function makes a positive zero-crossing correspond to the major excitations in the signal. For an analysis window size in the range of one to two pitch periods, these instants coincide with the instants of glottal closure in each pitch period. The method is robust, as it depends only on the average phase slope value, and further, it dpends only on the positive zero- crossing instants of the average phase slope function.

300dpi TIFF Images of pages:

776 777 778 779

Acrobat PDF file of whole paper:

ic950776.pdf

TOP



Interpolation of LPC Spectra via Pole Shifting

Authors:

Vladimir Goncharoff, University of Illinois at Chicago (USA)
Maureen Kaine-Krolak, University of Illinois at Chicago (USA)

Volume 1, Page 780

Abstract:

We present a new method for interpolating between LPC spectra via pole-shifting. This approach solves the problem of real-to-complex and complex-to-real pole transitions by converting to a domain where each pole has a complex conjugate. Desired pole shifts are calculated in the new domain after applying a perceptually-based pole pairing algorithm. Intermediate spectra corresponding to these pole transitions are then optimally approximated using the original number of poles. The resulting interpolated spectral sequence is characterized by approximately linear changes in formant frequencies and bandwidths, and is free of the artifacts that may occur with other LPC spectral parameter interpolation methods.

300dpi TIFF Images of pages:

780 781 782 783

Acrobat PDF file of whole paper:

ic950780.pdf

TOP



Speech Formant Frequency and Bandwidth Tracking Using Multiband Energy Demodulation

Authors:

Alexandros Potamianos, Georgia Institute of Technology (USA)
Petros Maragos, Georgia Institute of Technology (USA)

Volume 1, Page 784

Abstract:

In this paper, the AM--FM modulation model and a multiband analysis/demodulation scheme is applied to speech formant frequency and bandwidth tracking. Filtering is performed by a bank of Gabor bandpass filters. Each band is demodulated to amplitude envelope and instantaneous frequency signals using the energy separation algorithm. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals and their merits are presented. The estimates are used to determine the formant locations and bandwidths. Performance and computational issues (frequency domain implementation) are discussed. Overall, the multiband demodulation approach to formant tracking is easy to implement, provides accurate formant frequency and realistic bandwidth estimates, and performs well in the presence of nasalization.

300dpi TIFF Images of pages:

784 785 786 787

Acrobat PDF file of whole paper:

ic950784.pdf

TOP



Nonlinear Prediction for Speech Coding Using Radial Basis Functions

Authors:

Fernando Diaz-de-Mari, Universidaf de Cantabria
Anibal R. Figueiras-Vidal, Universidad Politecnica de Madrid (SPAIN)

Volume 1, Page 788

Abstract:

Radial Basis Functions (RBF) networks constitute an interesting option to deal with nonlinear prediction of speech because they provide a regularized solution and, then, they can guarantee the stability of the corresponding synthesis scheme; consequently, they are adequate to be used in Code Excited Nonlinear Prediction (CENP) coders. In this paper this approach is presented, and some simulations examples show its advantage in prediction performance. After this, the main points to arrive to practical implementations of CENP coders are addressed.

300dpi TIFF Images of pages:

788 789 790 791

Acrobat PDF file of whole paper:

ic950788.pdf

TOP



Recognition of Unvoiced Stops from Their Time- Frequency Representation

Authors:

Maria Rangoussi, National Technical University of Athens (GREECE)
Anastasios Delopoulos, National Technical University of Athens (GREECE)

Volume 1, Page 792

Abstract:

Recognition of the unvoiced stop sounds /k/, /p/ and /t/ in a speech signal is an interesting problem, due to the irregular, aperiodic, nonstationary nature of the corresponding signals. Their spotting is much easier, however, thanks to the characteristic silence interval they include. Classification of these three phonemes is therefore proposed in the present paper, based on patterns extracted from their time - frequency representation. This is possible because the different articulation points of /k/, /p/ and /t/ are reflected into distinct patterns of evolution of their spectral contents with time. These patterns can be obtained by suitable time - frequency analysis, and then used for classification. The Wigner distribution of the unvoiced stop signals, appropriately smoothed and subsampled, is proposed here as the basic classification pattern. Finally, for the classification step, the Learning Vector Quantization (LVQ) classifier of Kohonen is employed on a set of unvoiced stop signals extracted from the TIMIT speech database, with encouraging results under context- and speaker- independent testing conditions.

300dpi TIFF Images of pages:

792 793 794 795

Acrobat PDF file of whole paper:

ic950792.pdf

TOP