Session: SPEECH-P5
Time: 9:30 - 11:30, Wednesday, May 9, 2001
Location: Exhibit Hall Area 4
Title: Spoken Language Systems
Chair: Drew Halberstadt

9:30, SPEECH-P5.1
TOPIC FOREST: A PLAN-BASED DIALOG MANAGEMENT STRUCTURE
X. WU, F. ZHENG, M. XU
There are many task-oriented dialog systems, but few of them can cope with the issues such as the multiple-topic issue, the topic changing issue, the information sharing among different topics, and the difference in importance for different information items. To provide efficient solutions, a plan-based dialog management structure named Topic Forest is proposed, which makes the mixed-initiative dialog control easier. The Topic Forest based reasoning engine with a certain strategy for both remembering and forgetting is also described. The reasoning strategy is designed to be domain-independent; therefore it makes the dialog management model easy to be ported to any other different domains.

9:30, SPEECH-P5.2
A DYNAMIC SEMANTIC MODEL FOR RE-SCORING RECOGNITION HYPOTHESES
C. WAI, R. PIERACCINI, H. MENG
This paper describes the use of Belief Networks (BNs) for dynamic semantic modeling within the United Airlines' FLight InFOrmation service (FLIFO). Callers can speak naturally to obtain status information about all flights (including arrival and departure times) of United Airlines. In this work we aim at enabling the application to utilize dynamic call information to improve speech recognition performance. Dynamic call information include the location of the caller, the time and date of the call, and the caller's dialog history. Dynamic semantic models can incorporate such additional information about the call in re-scoring the N-best recognition hypotheses. Our experiments showed that this improved the recognition accuracy of flight number utterances from 84.95% to 86.80%.

9:30, SPEECH-P5.3
A MATCHING ALGORITHM BETWEEN ARBITRARY SECTIONS OF TWO SPEECH DATA SETS FOR SPEECH RETRIEVAL
Y. ITOH
This paper proposes a new matching algorithm to retrieve speech information from a speech database by speech query that allows continuous input. The algorithm is called Shift Continuous DP (CDP). Shift CDP extracts similar sections between two speech data sets. Two speech data sets are considered as reference patterns that are regarded as a speech database and input speech respectively. Shift CDP applies CDP to a constant length of unit reference patterns and provides a fast match between arbitrary sections in the reference pattern and the input speech. The algorithm allows endless input and real-time responses for the input speech query. Experiments were conducted for conversational speech and the results showed Shift CDP was successful in detecting similar sections between arbitrary sections of the reference speech and arbitrary sections of the input speech. This method can be applied to all kinds of time sequence data such as moving images.

9:30, SPEECH-P5.4
ADVANCES IN AUTOMATIC MEETING RECORD CREATION AND ACCESS
F. METZE, A. WAIBEL, M. BETT, K. RIES, T. SCHAAF, T. SCHULTZ, H. SOLTAU, H. YU, K. ZECHNER
Oral communication is transient but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellons University's Interactive Systems Laboratories we have been experimenting with the documentation of meetings. This paper summarizes part of the progress that we have made in this test bed, specifically on the question of automatic transcription using LVCSR, information access using non-keyword based methods, summarization and user inter faces. The system is capable to automatically construct a searchable and browsable audiovisual database of meetings and provide access to these records.

9:30, SPEECH-P5.5
EXPERIMENTS ON SPEECH TRACKING IN AUDIO DOCUMENTS USING GAUSSIAN MIXTURE MODELING
M. SECK, I. MAGRIN-CHAGNOLLEAU, F. BIMBOT
This paper deals with the tracking of speech segments in audio documents. We use a cepstral-based acoustic analysis and gaussian mixture models for the representation of the training data. Three ways of scoring an audio document based on a frame-level likelihood calculation are proposed and compared. Our experiments are done on a database composed of television programs including news reports, advertisements, and documentaries. The best equal error rate obtained is approximately 12%.

9:30, SPEECH-P5.6
MULTI-SCALE AUDIO INDEXING FOR TRANSLINGUAL SPOKEN DOCUMENT RETRIEVAL
H. WANG, H. MENG, P. SCHONE, B. CHEN, W. LO
MEI (Mandarin-English Information) is an English-Chinese crosslingual spoken document retrieval (CL-SDR) system developed during the Johns Hopkins University Summer Workshop 2000. We integrate speech recognition, machine translation, and information retrieval technologies to perform CL-SDR. MEI advocates a multi-scale paradigm, where both Chinese words and subwords (characters and syllables) are used in retrieval. The use of subword units can complement the word unit in handling the problems of Chinese word tokenization ambiguity, Chinese homophone ambiguity, and out-of-vocabulary words in audio indexing. This paper focuses on multi-scale audio indexing in MEI. Experiments are based on the Topic Detection and Tracking Corpora (TDT-2 and TDT-3), where we indexed Voice of America Mandarin news broadcasts by speech recognition on both the word and subword scales. In this paper, we discuss the development of the MEI syllable recognizer, the representations of spoken documents using overlapping subword n-grams and lattice structures. Results show that augmenting words with subwords is beneficial to CL-SDR performance.

9:30, SPEECH-P5.7
SIMPLIFYING DESIGN SPECIFICATION FOR AUTOMATIC TRAINING OF ROBUST NATURAL LANGUAGE CALL ROUTER
H. KUO, C. LEE
We study techniques that allow us to relax constraints imposed by expert knowledge in task specifications of natural language call router design. We intend to fully automate training of the routing matrix while maintaining the same level of performance (over 90% accuracy) as that in an optimized system. Two specific issues are investigated: reducing matrix size by removing word pairs and triplets in key term definition leaving only single word terms; and increasing matrix size by removing the need for defining stop words and performing stop word filtering. Since such simplification causes performance degradation, discriminative training of routing matrix parameters becomes essential. Our experiments show that performance degradation caused by relaxing design constraints can be compensated entirely by minimum error classification (MCE) training even with the above two simplifications. We believe the procedure is applicable to algorithms addressing a broad range of speech understanding, topic identification, and information retrieval problems.

9:30, SPEECH-P5.8
SPEECH-TO-SPEECH TRANSLATION BASED ON FINITE-STATE TRANSDUCERS
F. CASACUBERTA, D. LLORENS, C. MARTÍNEZ, S. MOLAU, F. NEVADO, H. NEY, M. PASTOR, D. PICÓ, A. SANCHIS, E. VIDAL , J. VILAR
Nowadays, the most successful speech recognition systems are based on stochastic finite-state networks (hidden Markov models and n-grams). Speech translation can be accomplished in a similar way as speech recognition. Stochastic finite-state transducers, which are specific stochastic finite-state networks, have proved very adequate for translation modeling. In this work a speech-to-speech translation system, the EuTrans system, is presented. The acoustic, language and translation models are finite-state networks that are automatically learnt from training samples. This system was assessed in a series of translation experiments from Spanish to English and from Italian to English in an application involving the interaction (by telephone) of a customer with a receptionist at the front-desk of a hotel.