Chair: Al Alewa, Microsoft Corporation (USA)
P. Laface, Politechnico di Torino
C. Vair, Politechnico di Torino
L. Fissore, CSELT (ITALY)
The paper presents a Fast Segmental Viterbi Algorithm. A new search strategy particularly effective for very large vocabulary word recognition. It performs a tree based, time synchronous, left-to-right beam search that develops time-dependent acoustic and phonetic hypotheses. At any given time, it makes active a subword unit associated to an arc of a lexical tree only if that time is likely to be the boundary between the current and the next unit. This new technique, tested with a vocabulary of 188892 directory entries, achieves the same results obtained with Viterbi algorithm, with a 35% speedup. Results are also presented for a 718 word, speaker independent continuous speech recognition task.
Z. Li, Universite du Quebec (CANADA)
P. Kenny, Universite du Quebec (CANADA)
D. O'Shaughnessy, Universite du Quebec (CANADA)
We present new developments in our approach to the search problem in very large vocabulary speech recognition. A transcription graph, which encodes all high scoring phonetic transcriptions which satisfy lexical constraints, is built so as to completely separate the search of the lexicon from the construction of the word graph. This enables us to limit the computational burden of searching in a way which is essentially independent of the size of the language model.
Renato De Mori, McGill University School of Computer Science (CANADA)
Charles Snow, McGill University School of Computer Science (CANADA)
Michael Galler, McGill University School of Computer Science (CANADA)
New techniques are introduced to manage speaker variability in large-vocabulary speaker-independent automatic speech recognition systems. Stochastic inference networks are used to model speaker intentions and automatically derive multiple phonemic transcriptions of words. The statistical inferences are generalized and managed by a Justification Maintenance System module integrated with the speech module, to dynamically adapt the system's performance to the variability of multi-speaker use. Additionally, lexical access to the database of phonemic transcriptions is mediated in multi-pass search with a new method of rejection based on beam-search derived lattices of pseudo-syllabic segments.
P.S. Gopalakrishnan, IBM
L.R. Bahl, IBM
R.L. Mercer, Renaissance Technologies (USA)
In this paper we describe a tree search strategy, called the Envelope Search, which is a time-asynchronous search scheme that combines aspects of the A* heuristic search algorithm with those of the time-synchronous Viterbi search algorithm. This search technique is used in the large-vocabulary continuous speech recognition system developed at the IBM Research Center.
F. Richardson, Boston University
M. Ostendorf, Boston University
J.R. Rohlicek, Bolt Beranek & Newman Inc. (USA)
The design of search algorithms is an important issue in large vocabulary speech recognition, especially as more complex models are developed for improving recognition accuracy. Recently, multi-pass search strategies have been used as a means of applying simple models early on to prune the search space for subsequent passes using more expensive knowledge sources. The pruned search space is typically represented by an N-best sentence list or a word lattice. Here, we investigate three alternatives for lattice search: N-best rescoring, a lattice dynamic programming search algorithm and a lattice local search algorithm. Both the lattice dynamic programming and lattice local search algorithms are shown to achieve comparable performance to the N-best search algorithm while running as much as 10 times faster on a 20k word lexicon; the local search algorithm has the additional advantage of accommodating sentence-level knowledge sources.
T. Moudenc, France Telecom (FRANCE)
D. Jouvet, France Telecom (FRANCE)
J. Monne, France Telecom (FRANCE)
This paper describes a new method for the post-processing of N-best solutions based on stochastic modelling of the number of speech signal stationarity changes which occur within the phonetic segments of each solution. The objective of this post-processing is to validate the presence of stationarity zones in the speech signal. This particular validation cannot be exploited using a centisecond approach. The signal stationarity changes are detected using an "a priori" segmentation algorithm. Two phonetic models are calculated for each phonetic segment. One corresponds to correct solutions and the other one corresponds to incorrect solutions. These two models are used simultaneously in order to compute a post-processing score for each solution. In the initial set of experiments, which was conducted on telephone databases, the use of this method resulted in a 9% error rate reduction on the "Number" database, and a 15% error rate reduction on the "Digit" database.
Tohru Shimizu, ATR/ITL
Seikou Monzen, Yamagata University
Harald Singer, ATR/ITL (JAPAN)
Shoichi Matsunaga, ATR/ITL (JAPAN)
This paper proposes a time-synchronous continuous speech recognizer driven by a context-free grammar that integrates generalized LR parser based phoneme context prediction and context-dependent HMMs. In this method, a phoneme hypotheses trie is introduced for the phoneme history representation of possible LR states, and an LR state network is introduced for LR path merging. Both techniques reduce the amount of computation. The experimental results show that this new method is more efficient than the conventional LR parser driven phoneme-synchronous continuous speech recognizer.
Giuliano Antoniol, I.R.S.T. (ITALY)
Fabio Brugnara, I.R.S.T. (ITALY)
Mauro Cettolo, I.R.S.T. (ITALY)
Marcello Federico, I.R.S.T. (ITALY)
This paper presents an efficient way of representing a bigram language model for a beam-search based, continuous speech, large vocabulary HMM recognizer. The tree-based topology considered takes advantage of a factorization of the bigram probability derived from the bigram interpolation scheme, and of a tree organization of all the words that can follow a given one. Moreover, an optimization algorithm is used to considerably reduce the space requirements of the language model. Experimental results are provided for two 10,000-word dictation tasks: radiological reporting (perplexity 27) and newspaper dictation (perplexity 120). In the former domain 93% word accuracy is achieved with real-time response and 23 Mb process space. In the newspaper dictation domain, 88.1% word accuracy is achieved with 1.41 real-time response and 38 Mb process space. All recognition tests were performed on an HP-735 workstation.
Sarvar Patel, Bellcore (USA)
In continuous speech recognition, when using statistical language models (e.g. bigrams) a significant amount of time is used every frame to evaluate interword transitions. In fact, if N is the size of vocabulary, O(N^2) operations are required per frame. Also, when evaluating fully connected HMM with N states, the Viterbi algorithm requires O(N^2) operations per frame. This paper presents the first algorithm to break the O(N^2) complexity requirement in the Viterbi algorithm, whether evaluating interword transitions or evaluating a fully connected HMM. The algorithm presented has an average complexity of O(N root-N). Previous speed-ups of the evaluations of interword transitions used heuristics, like pruning, or relied upon unavailability of many of the bigram values. However, this paper does not rely on any heuristics but fundamentally improves the basic evaluation of the time synchronous Viterbi algorithm.
Steve Renals, University of Sheffield
Mike Hochberg, University of Cambridge (UK)
A novel, efficient search strategy for large vocabulary continuous speech recognition (LVCSR) is presented. The search algorithm, based on stack decoding, uses posterior phone probability estimates to substantially increase its efficiency with minimal effect on accuracy. The search space is dramatically reduced by phone deactivation pruning where phones with a small local posterior probability are deactivated. This approach is particularly well-suited to hybrid connectionist/hidden Markov model systems because posterior phone probabilities are directly computed by the acoustic model. Results from a hybrid system are presented using the North American Business News LVCSR database for a 20,000 word task using a backed-off trigram language model. For this task, our single-pass decoder took around 15x realtime on an HP735 workstation, an order of magnitude speed increase at a cost of less than 2% relative search error. Realtime decoding on this task was achieved with around 7% relative search error.