Chair: Biing Hwang Juang, AT&T Bell Laboratories (USA)
Jan P. Verhasselt, University of Ghent (BELGIUM)
Jean-Pierre Martens, University of Ghent (BELGIUM)
In this paper, an artificial neural network (ANN) architecture for modeling the transitions between consecutive phones is presented. These `phone transition' models are particularly suited for taking into account the coarticulation phenomena in continuous speech. In order to obtain robust and generalizing probability estimates, the evidences of the variable frame rate-based transition models and those of context-independent segment-based phone models are combined by means of an additional ANN, called the Transition Controlled Neural Network (TCNN). The concept of the transition approach was introduced at ICSLP'94, but in this paper a new and more sophisticated implementation is proposed and evaluated on a phone recognition task. The new TCNN-approach significantly outperforms the old one.
Hans-Peter Hutter, TIK/ETH Zurich (SWITZERLAND)
This paper compares a newly proposed hybrid connectionist-SCHMM approach with other hybrid approaches. In the new approach a multilayer perceptron (MLP) replaces the conventional codebooks of semi-continuous HMMs. The MLP is therefore trained on so-called basic elements (phones and phone parts) in such a way that the outputs of the network estimate the a posteriori probabilities of these elements, given a context of input vectors. These a posteriori estimates are converted into scaled likelihoods, which are then used as observation probabilities in the framework of classical SCHMMs. The remaining parameters of the SCHMMs are trained with the well-known Baum-Welch algorithm using the estimated likelihoods of the MLP. This approach compared favorably with other recently proposed hybrid systems and classical approaches on an isolated German digit recognition task over telephone lines. It exhibited the highest recognition rate of all systems, followed by an approach using LVQ3 optimization of the codebook.
T. L. Burrows, Cambridge University (UK)
M. Niranjan, Cambridge University (UK)
In this paper, the speech production system is modelled using the true glottal excitation as the source and a recurrent neural network to represent the vocal tract. The hidden nodes have multiple delays of one and two samples, making the network equivalent to a parallel formant synthesiser in the linear regions of the hidden node sigmoids. An ARX model identification is carried out to initialise the neural network parameters. These parameters are re-estimated in an analysis-by-synthesis framework to minimise the synthesis (output) error. Unlike other analysis-by-synthesis speech production models such as CELP, the source and filter in this approach are decoupled, enabling manipulation of the source time-scale to achieve high quality pitch changes.
Tan Lee, Chinese University of Hong Kong (HONG KONG)
P.C. Ching, Chinese University of Hong Kong (HONG KONG)
L.W. Chan, Chinese University of Hong Kong (HONG KONG)
This paper describes a new method of utilizing recurrent neural networks (RNNs) for speech modeling and speech recognition. For each particular speech unit, a fully connected recurrent neural network is built such that the static and dynamic speech characteristics are represented simultaneously by a specific temporal pattern of neuron activation states. By using the temporal RNN output, an input utterance can be represented as a number of stationary speech segments, which may be related to the basic phonetic components of the speech unit. An efficient self-supervised training algorithm has been developed for the RNN speech model. The segmentation for input utterances and the statistical modeling for individual phonetic segments are performed interactively in this training process. Some experimental results are used to demonstrate how the proposed RNN speech model can be used effectively for automatic recognition of isolated speech utterances.
Manish Sharma, Rutgers University (USA)
Richard Mammone, Rutgers University (USA)
A new neural tree network (NTN) -based speech recognition system is presented. NTN is a hierarchial classifier that combines the properties of decision trees and feed-forward neural networks. In the sub-word unit-based system, the NTNs model the sub-word speech segments, while the Viterbi algorithm is used for temporal alignment. Durational probability is associated with each sub-word NTN. An iterative algorithm is proposed for training the sub-word NTNs. The sub-word NTN models, as well as the sub- word segment boundaries within a vocabulary word, are re-estimated. Thus, the proposed system is a homogeneous neural network -based, sub-word unit-based, speech recognition system. Furthermore, embedded within this word model paradigm, multiple NTNs are trained for each sub-word segment and their output decisions are combined or fused to yield improved performance. The proposed discriminatory training-based system did not perform favourably as compared to a Hidden Markov model-based system. The paradigm presented in this paper can be argued to represent a class of discriminatory training-based, homogeneous (versus hybrid), sub-word unit-based, speech recognition systems. Hence, the results reported here can be generalized to other similar systems.
Ulrich Bodenhausen, University of Karlsruhe (GERMANY)
Hermann Hild, University of Karlsruhe (GERMANY)
The successful application of speech recognition systems to new domains greatly depends on the tuning of the architecture to the new task, especially if the amount of training data is small. For example, the application of Multi-Layer Perceptrons (MLPs) to speech recognition requires the optimization of the number of hidden units, the size of the input windows over time and the number of states that model an acoustic event. Previously, we have proposed the Automatic Structure Optimization algorithm (ASO) that optimizes all of the above architectural parameters automatically. In this paper we 1.) present results for the succesful application of the ASO algorithm to connected spoken letter recognition, 2.) show the suitability of the algorithm for various sizes of the system and 3.) analyze the computational efficiency of the automatic optimization process for four different tasks.
K. Kasper, Institut fur Angewandte Physik (GERMANY)
H. Reininger, Institut fur Angewandte Physik (GERMANY)
D. Wolf, Institut fur Angewandte Physik (GERMANY)
H. W st, Institut fur Angewandte Physik (GERMANY)
For a variety of telephone applications it is sufficient to realize a speech recognition system (SRS) with a system vocabulary consisting of a few command words, digits, and connected digits. However, in the development of a SRS for application in telephone environment it has to be considered that the speech is bandpass limited and a high recognition performance has to be guaranteed under speaker independent and even adverse conditions. Fully recurrent neural networks (FRNN) provide a new approach for realizing a robust SRS with a single network. FRNN are able to perform the process of feature scoring discriminatively and independently of the length of the feature sequence. Here we report about investigations to realize a monolithic SRS based on FRNN for telephone speech. Besides isolated word recognition, the capability of FRNN-SRS to deal with connected digit recognition is presented. Furthermore it is shown how FRNN could be immunized against several types of additive noise.
W. Reichl, Munich University of Technology (GERMANY)
G. Ruske, Munich University of Technology (GERMANY)
A hybrid system for continuous speech recognition, consisting of a neural network with Radial Basis Functions and Hidden Markov Models is described in this paper together with discriminant training techniques. Initially the neural net is trained to approximate a-posteriori probabilities of single HMM states. These probabilities are used by the Viterbi algorithm to calculate the total scores for the individual hybrid phoneme models. The final training of the hybrid system is based on the `Minimum Classification Error' objective function, which approximates the misclassification rate of the hybrid classifier, and the `Generalized Probabilistic Descent' algorithm. The hybrid system was used in continuous speech recognition experiments with phoneme units and shows about 63.8% phoneme recognition rate in a speaker-independent task.
Earl Levine, Stanford University (USA)
A method is proposed to improve any temporal pattern recognition system by time warping each pattern before presentation to the recognition system. The time warping function for a pattern is generated by repeated local application of a neural network to sections of the pattern. The output of this neural network is the slope of the warping function, and the internal weight parameters are trained by a gradient descent learning rule which attempts to minimize the recognition system's error. Experimental results show that this method can improve recognition of vowel phonemes.
Q.J. Zhang, Carleton University (CANADA)
Fang Wang, Carleton University (CANADA)
M.S. Nakhla, Carleton University (CANADA)
An important yet challenging task for neural network based speech recognizers is the effective processing of temporal information in speech signals. A high-order fully recurrent neural network is developed to effectively handle the sequential nature of speech signals and to accommodate both temporal and spectral variations. The proposed neural network has 4 layers, namely, the input layer, self organizing map, fully recurrent hidden layer and output layer. The important characteristics of the hidden neurons and the output neurons are their high-order processing feature. A 2-stage unsupervised/supervised training method is developed. The solution from unsupervised training provides a good starting point for supervised training. The proposed neural network and the training method are applied to isolated word recognition using the TI20 data.