1:00, SPEECH-L1.1
AUTOMATIC TRANSCRIPTION OF VOICEMAIL AT AT&T
M. BACCHIANI
This paper reports on the automatic transcription accuracy of
voicemail messages. It shows that vocaltract length normalization and
adaptation using linear transformations, proven to improve accuracy on
the Switchboard task, provide similar accuracy improvements on this
task. Direct application of the normalization techniques is
complicated by the fragmentation of the data. However, unsupervised
clustering was found to be effective in ensuring robust estimation of
normalization parameters. Variance adaptation resulted in larger
accuracy improvements than adaptation of only mean parameters,
probably due to a large variability in channel conditions. The use of
semi-tied covariances provided additional gains over using speaker
and channel normalization. The combined gain of using various
compensation techniques improved the system word error rate from
34.9% for the baseline system to 28.7%.
1:20, SPEECH-L1.2
ERROR CORRECTIVE MECHANISMS FOR SPEECH RECOGNITION
L. MANGU, M. PADMANABHAN
In the standard MAP approach to speech recognition, the goal is to find the word sequence with the highest posterior probability. Recently, a number of alternate approaches have been proposed for directly optimizing the word error rate, the commonly used evaluation criterion. One of them, the consensus decoding approach, converts a word lattice into a confusion network which specifies the sequence of word-level confusions and outputs the word with the highest posterior probability from each confusion set. This paper presents a method for discriminating between the correct and alternate hypotheses in a confusion set using additional knowledge sources extracted from the confusion networks. We use transformation-based learning for inducing a set of rules to guide a better decision between the top two candidates with the highest posterior probabilities in each confusion set. In experiments on Switchboard corpus, we show significant improvements over the consensus decoding approach.
1:40, SPEECH-L1.3
EXPLICIT WORD ERROR MINIMIZATION USING WORD HYPOTHESIS POSTERIOR PROBABILITIES
F. WESSEL, R. SCHLÜTER, H. NEY
In this paper, we introduce a new concept, the time frame error rate. We show that this error rate is closely correlated with the word error rate and use it to overcome the mismatch between Bayes' decision rule which aims at minimizing the expected sentence error rate and the word error rate which is used to assess the performance of speech recognition systems. Based on the time frame errors we derive a new decision rule and show that the word error rate can be reduced consistently with it on various recognition tasks. All stochastic models are left completely unchanged. We present experimental results on five corpora, the Dutch Arise corpus, the German Verbmobil '98 corpus, the English North American Business '94 20k and 64k development corpora, and the English Broadcast News '96 corpus. The relative reduction of the word error rate ranges from 2.3% to 5.1%.
2:00, SPEECH-L1.4
FROM BROADCAST NEWS TO SPONTANEOUS DIALOGUE TRANSCRIPTION: PORTABILITY ISSUES
N. BERTOLDI, F. BRUGNARA, M. CETTOLO, M. FEDERICO, D. GIULIANI
This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains: appointment scheduling and tourist information.
The trade-off between performance and the required amount of task specific data was investigated. Porting was experimented by applying supervised adaptation methods on acoustic and language models. By using manual transcripts corresponding to two hours of speech, and one hour of annotated speech, word error rates of 27.0% and 29.3% were achieved by the adapted systems. As a reference, two domain specific baselines systems, developed on much more training data, achieved word error rates of 22.6% and 22.0, respectively.
2:20, SPEECH-L1.5
GENERATION AND EXPANSION OF WORD GRAPHS USING LONG SPAN CONTEXT INFORMATION
C. NEUKIRCHEN, D. KLAKOW, X. AUBERT
A new algorithm for the generation of word graphs in a cross-word
decoder that uses long span m-gram language models is
presented. The generation of word hypotheses within the
graph relies on the word m-tuple-based boundary optimization.
The graphs contain the full word history knowledge information
since the graph structure reflects all LM constraints
used during the search. This results in better word
boundaries and in enhanced capabilities
to prune the graphs.
Futhermore the memory costs for expanding the m-gram constrained word
graphs to apply very long span LMs (e.g. ten-grams that are
constructed by log linear LM combination) are considerably reduced.
Experiments for lattice generation and rescoring
have been carried out on the 5K-word WSJ task and the
64K-word NAB task.
2:40, SPEECH-L1.6
IMPROVED DISCRIMINATIVE TRAINING TECHNIQUES FOR LARGE VOCABULARY SPEECH RECOGNITION
D. POVEY, P. WOODLAND
This paper investigates the use of discriminative training techniques
for large vocabulary speech recogntion with training datasets up to
265 hours. Techniques for improving lattice-based Maximum Mutual
Information Estimation (MMIE) training are described and compared to
Frame Discrimination (FD). An objective function which is an
interpolation of MMIE and standard Maximum Likelihood Estimation (MLE)
is also discussed. Experimental results on both the Switchboard and
North American Business News tasks show that MMIE training can yield
significant performance improvements over standard MLE even for the
most complex speech recognition problems with very large training
sets.