Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session SP-13 |
|
SP-13.1
|
Investigations on Inter-Speaker Variability in the Feature Space
Reinhold Haeb-Umbach (Philips Research Laboratories)
We apply Fisher variate analysis to measure the effectiveness of speaker
normalization techniques. A trace criterion, which measures the ratio of the
variations due to different phonemes compared to variations due to different
speakers, serves as a first assessment of a feature set without the
need for recognition experiments. By using this measure and by recognition
experiments we demonstrate that cepstral mean normalization also has a speaker
normalization effect, in addition to the well-known channel normalization
effect. Similarly vocal tract normalization (VTN) is shown to remove
inter-speaker variability. For VTN we show that normalization on a per sentence
basis performs better than normalization on a per speaker basis. Recognition
results are given on Wallstreet Journal and Hub-4 databases.
|
SP-13.2
|
LSP Weighting Functions Based on Spectral Sensitivity and Mel-Frequency Warping for Speech Recogntion in Digital Communication
Seung Ho Choi (Dept. of Electrical Eng., Korea Advanced Institute of Science and Technology, 373-1 Kusong-Dong, Yusong-Ku, Taejon 305-701, Korea),
Hong Kook Kim (AT&T Labs Research, Rm. E148, 180 Park Avenue, Florham Park NJ 07932, USA),
Hwang Soo Lee (Central Research Laboratory, SK Telecom, 58-4 Hwaam-Dong, Yusong-Gu, Taejon 305-348, Korea)
In digital communication networks, a speech recognition system
extracts feature parameters after reconstructing speech signals.
In this paper, we consider a useful approach of incorporating speech coding
parameters into a speech recognizer.
Most speech coders employ line spectrum
pairs (LSPs) to represent spectral parameters.
We introduce weighted distance measures to
improve the recognition performance of an LSP-based speech recognizer.
Experiments on speaker-independent connected-digit recognition showed
that weighted distance measures provide better
recognition accuracy than unweighted distance measures do.
Compared with a conventional method
employing mel-frequency cepstral coefficients,
the proposed method achieved higher performance in terms of a recognition accuracy.
|
SP-13.3
|
Two-Dimensional Multi-Resolution Analysis of Speech Signals and Its Application to Speech Recognition
Chun-ping Chan,
Yiu-wing Wong,
Tan Lee,
Pak-chung Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
This paper describes a novel approach of using multi-resolution analysis (MRA)
for automatic speech recognition. Two-dimensional MRA is applied to the
short-time log spectrum of speech signal to extract the slowly varying spectral
envelope that contains the most important articulatory and phonetic information.
After passing through a standard cepstral analysis process, the MRA features
are used for speech recognition in the same way as conventional short-time
features like MFCCs, PLPs, etc. Preliminary experiments on both clean connected
speech and noisy telephone conversation speech show that the use of MRA cepstra
results in a significant reduction in insertion error when compared with MFCCs.
|
SP-13.4
|
Hierarchical Subband Linear Predictive Cepstral (HSLPC) Features for HMM-Based Speech Recognition
Rathinavelu Chengalvarayan (Lucent Technologies Inc.)
In this paper, a new approach for linear prediction (LP)
analysis is explored, where predictor can be computed
from a mel-warped subband-based autocorrelation functions
obtained from the power spectrum. For spectral representation
a set of multi-resolution cepstral features are proposed.
The general idea is to divide up the full frequency-band
into several subbands, perform the IDFT on the mel power
spectrum for each subband, followed by Durbin's algorithm
and the standard conversion from LP to cepstral coefficients.
This approach can be extended to several levels of
different resolutions. Muti-resolution feature vectors,
formed by concatenation of the subband cepstral features
into an extended feature vector, are shown to yield better
performance than the conventional mel-warped LPCCs
over the full voice-bandwidth for connected digit
recognition task.
|
SP-13.5
|
Towards a Robust/Fast Continuous Speech Recognition System Using a Voiced-Unvoiced Decision
Douglas O'Shaughnessy,
Hesham Tolba (INRS-Telecommunications)
In this paper, we show that the concept of Voiced-Unvoiced (V-U) classification of speech sounds can be
incorporated not only in speech analysis or speech enhancement processes, but also can be useful for
recognition processes. That is, the incorporation of such a classification in a continuous speech recognition
(CSR) system not only improves its performance in low SNR environments, but also limits the time and the
necessary memory to carry out the process of the recognition. The proposed V-U classification of the speech
sounds has two principal functions: (1) it allows the enhancement of the voiced and unvoiced parts of speech
separately; (2) it limits the Viterbi search space, and consequently the process of recognition can be carried
out in real time without degrading the performance of the system. We prove via experiments that such a system
outperforms the baseline HTK when a V-U decision is included in both front- and far-end of the HTK-based
recognizer.
|
SP-13.6
|
A C/V SEGMENTATION ALGORITHM FOR MANDARIN SPEECH SIGNAL BASED ON WAVELET TRANSFORMS
Jhing-Fa Wang (Department of Electrical Engineering & Department of Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.),
Shi-Huang Chen (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.)
This paper proposes a new consonant/vowel (C/V) segmentation algorithm for Mandarin speech signal. Since the Mandarin phoneme structure is a combination of a consonant (may be null) followed by a vowel, the C/V segmentation is an important part in the Mandarin speech recognition system. Based on the wavelet transform, the proposed method can directly search for the C/V segmentation point by using a product function and energy profile. The product function is generated from the appropriate wavelet and scaling coefficients of input speech signal and it can be applied to indicate the C/V segmentation point. With this product function and the additional verification of energy profile, the C/V segmentation point can be accurately pointed out with a low computation complexity. Experiments are provided that demonstrate the superior performance of the proposed algorithm. An overall accuracy rate of 97.2% is achieved. This algorithm is suitable for Mandarin speech recognition task.
|
SP-13.7
|
Feature Extraction for Speech Recognition Based on Orthogonal Acoustic Feature Planes and LDA
Tsuneo Nitta (Toyohashi University of Technology)
This paper describes an attempt to extract multiple topological structures,
hidden in time-spectrum (TS) patterns, by using multiple mapping functions,
and to incorporate the functions into the feature extractor of a speech recognition system.
In the previous work, the author proposed a novel feature extraction method
based on MAFP/KLT (MAFP: multiple acoustic feature planes), in which 3*3 derivative
operators were used for mapping functions, and showed that the method achieved significant
improvement in preliminary experiments.
In this paper, firstly, the mapping functions are directly extracted in the form of a 3*3
orthogonal basis from a speech database. Next, the functions are evaluated, together with
3*3 simplified operators modeled on the orthogonal basis.
Finally, after comparing the experimental results, the author proposes an effective
feature extraction method based on MAFP/LDA, in which a Sobel operator is used for mapping
functions.
|
SP-13.8
|
Distinctive Feature Detection Using Support Vector Machines
Partha Niyogi,
Chris Burges,
Padma Ramesh (Bell Labs, Lucent Technologies, USA.)
An important aspect of distinctive feature based approaches to
automatic speech recognition is the formulation of a framework for
robust detection of these features. We discuss the application of the
support vector machines (SVM) that arise when the structural risk
minimization principle is applied to such feature detection problems.
In particular, we describe the problem of detecting stop consonants in
continuous speech and discuss an SVM framework for detecting these
sounds. In this paper we use both linear and nonlinear SVMs for stop
detection and present experimental results to show that they perform
better than a cepstral features based hidden Markov model (HMM)
system, on the same task.
|
|