9:30, SPEECH-P11.1
AN EFFICIENT AND SCALABLE 2D DCT-BASED FEATURE CODING SCHEME FOR REMOTE SPEECH RECOGNITION
A. ALWAN, Q. ZHU
A 2D DCT-based approach to compressing acoustic features for remote speech recognition applications is presented. The coding scheme involves computing a 2D DCT on blocks of feature vectors followed by uniform scalar quantization, run-length and Huffman coding. Digit recognition experiments were conducted in which training was done with unquantized cepstral features from clean speech and testing used the same features after coding and decoding with 2D DCT and entropy coding and in various levels of acoustic noise. The coding scheme results in recognition performance comparable to that obtained with unquantized features at low bitrates. 2D DCT coding of MFCCs together with a method for variable frame rate analysis [Zhu and Alwan, 2000] and peak isolation [Strope and Alwan, 1997] maintains the noise robustness of these algorithms at low SNRs even at 624 bps. The low-complexity scheme is scalable resulting in graceful degradation in performance with decreasing bit rate.
9:30, SPEECH-P11.2
EVALUATION OF MEL-LPC CEPSTRUM IN LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
H. MATSUMOTO, M. MOROTO
This paper presents a simple and efficient time domain technique to
estimate an all-pole model on the mel-frequency scale (Mel-LPC),
and compares the recognition performance of Mel-LPC cepstrum with those
of both the standard LPC mel-cepstrum and the MFCC through the Japanese
dictation system (Julius) with 20,000 word vocabulary.
First, the optimal value of frequency warping factor is examined
in terms of monosyllable accuracy. When using the optimal warping factors,
Mel-LPC cepstrum attains the word accuracies of 93.0\% for male speakers
and 93.1\% for female speakers, which are 2.1\% and 1.7 \% higher than
those of the LPC mel-cepstrum, respectively. Furthermore, this performance
is slightly superior to that of MFCC.
9:30, SPEECH-P11.3
INTEGRATION OF FIXED AND MULTIPLE RESOLUTION ANALYSIS IN A SPEECH RECOGNITION SYSTEM
R. GEMELLO, D. ALBESANO, L. MOISA, R. DE MORI
This paper compares the performance of an operational Automatic Speech Recognition system when MFCCs, J-RASTA Perceptual Linear Prediction Coefficients (J-Rasta PLP) and energies from a Multi Resolution Analysis (MRA) tree of filters are used as input features to a hybrid system consisting of a Neural Network (NN) which provides observation probabilities for a network of Hidden Markov Models (HMM). Furthermore, the paper compares the performance of the system when various combinations of these features are used showing a WER reduction of 16% w.r.t. the use of J-Rasta PLP coefficients, when J-Rasta PLP coefficients are combined with the energies computed at the output of the leaves of an MRA filter tree. Such a combination is practically feasible thanks to the NN architecture used in the system. Recognition is performed without any language model on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network.
9:30, SPEECH-P11.4
PERCEPTUAL HARMONIC CEPSTRAL COEFFICIENTS FOR SPEECH RECOGNITION IN NOISY ENVIRONMENT
L. GU, K. ROSE
Perceptual harmonic cepstral coefficients (PHCC) are proposed as features to extract from speech for recognition in noisy environments. A weighting function, which depends on the prominence of the harmonic structure, is applied to the power spectrum to ensure accurate representation of the voiced speech spectral envelope. The harmonics weighted power spectrum undergoes mel-scaled band-pass filtering, and the log-energy of the filters’ output is discrete cosine transformed to produce cepstral coefficients. Lower spectral clipping is applied to the power spectrum, followed by within-filter root-power amplitude compression to reduce amplitude variation without compromise of the gain invariance properties. Experiments show significant recognition gains of PHCC over MFCC, with 23% and 36% error rate reduction for the Mandarin digit database in white and babble noise environments.
9:30, SPEECH-P11.5
PERIPHERAL FEATURES FOR HMM-BASED SPEECH RECOGNITION
T. FUKUDA, M. TAKIGAWA, T. NITTA
This paper describes an attempt to extract peripheral features of a point c(ti,qj) on a time-quefrency (TQ) pattern by observing nn neighborhoods of the point, and then to incorporate these peripheral features into the MFCC-based feature extractor of a speech recognition system as a replacement to dynamic features. In the design of the feature extractor, firstly, the orthogonal bases extracted directly from speech data by using KLT of 73 blocks on a TQ pattern are adopted as the peripheral features, then, the upper two primal bases are selected and simplified in the form of delta-t operator and delta-q operator. The proposed feature-set of MFCC and peripheral features shows significant improvements in comparison with the standard feature-set of MFCC and dynamic features in experiments with an HMM-based ASR system. The reason for the increased performance is discussed in terms of minimal-pair tests.
9:30, SPEECH-P11.6
USING PHASE SPECTRUM INFORMATION FOR IMPROVED SPEECH RECOGNITION PERFORMANCE
R. SCHLUETER, H. NEY
In this work, new acoustic features for continuous speech recognition based on the short-term Fourier phase spectrum are introduced for mono (telephone) recordings. The new phase based features were combined with standard Mel Frequency Cepstral Coefficients (MFCC), and results were produced with and without using additional linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill corpus for telephone line recorded German digit strings. Using LDA to combine purely phase based features with MFCCs, we obtained improvements in word error rate of up to 25% relative to using MFCCs alone with the same overall number of
parameters in the system.
9:30, SPEECH-P11.7
A STUDY OF TWO DIMENSIONAL LINEAR DISCRIMINANTS FOR ASR
S. KAJAREKAR, B. YEGNANARAYANA, H. HERMANSKY
In this paper we study the information in the joint time-frequency
domain using 1515 dimensional - 15 spectral energies and temporal
span of 1 s - block of spectrogram as features.In this feature space, we first derive 20 joint linear discriminants (JLDs) using linear discriminant analysis (LDA). Using principal component analysis (PCA), we conclude that information in this block of the spectrogram can be analyzed independently across the time and frequency domains. Under this assumption, we propose a sequential design of two dimensional discriminants (CLDs), i.e., spectral discriminants followed by temporal discriminants. We show that these CLDs are similar to first few JLDs and the discriminant features derived from the CLDs outperform those obtained from JLDs in the continuous-digit recognition task.
9:30, SPEECH-P11.8
FORMANT WEIGHTED CEPSTRAL FEATURE FOR LSP-BASED SPEECH RECOGNITION
H. HUR, H. KIM
In this paper, we propose a formant weighted cepstral feature for LSP-based speech recognition system. The proposed weighting scheme is based on the well-known property of LSPs that the speech spectrum has a peak when adjacent LSFs come close. By applying this scheme to pseudo-cepstrum (PCEP) conversion process, we can obtain formant weighted or peak enhanced cepstral feature. Results of speech recognition experiments using QCELP coder output show that the proposed feature set outperforms the conventional features such as LSP or PCEP. Moreover its performance also exceeds that of unquantized LPC cepstrum.
9:30, SPEECH-P11.9
ON THE USE OF MATRIX DERIVATIVES IN INTEGRATED DESIGN OF DYNAMIC FEATURE PARAMETERS FOR SPEECH RECOGNITION
R. CHENGALVARAYAN
In this work, an integrated approach to vector
dynamic feature extraction is described in the
design of a hidden Markov model based speech
recognizer. The new model contains state-dependent,
vector-valued weighting functions responsible for
transforming static speech features into the
dynamic ones. In this paper, the minimum
classification error (MCE) is extended from the
earlier formulation of VVD-IHMM that applies to a
novel maximum likelihood based training algorithm.
9:30, SPEECH-P11.10
SUBBAND FEATURE EXTRACTION USING LAPPED ORTHOGONAL TRANSFORM FOR SPEECH RECOGNITION
Z. TUFEKCI, J. GOWDY
It is well known that dividing speech into frequency subbands can improve the performance of a speech recognizer. This is especially true for the case of speech corrupted with noise. Subband (SUB) features are typically extracted by dividing the frequency band into subbands by using non-overlapping rectangular windows and then processing each subband's spectrum separately. However, multiplying a signal by a rectangular window creates discontinuities which produce large amplitude frequency coefficients at high frequencies that degrade the performance of the speech recognizer. In this paper we propose the Lapped Subband (LAP) features which are calculated by applying the Discrete Orthogonal Lapped Transform (DOLT) to the mel-scaled, log-filterbank energies of a speech frame. Performance of the LAP features was evaluated on a phoneme recognition task and compared with the performance of SUB features and MFCC features. Experimental results have shown that the proposed LAP features outperform SUB features and Mel Frequency Cepstral Coefficients (MFCC) features under white noise, band-limited white noise and no noise conditions.