Session: SPEC-L1
Time: 3:30 - 5:30, Tuesday, May 8, 2001
Location: Room 150
Title: Conversational Systems for Pervasive Computing
Chair: Ponani S. Gopalakrishnan

3:30, SPEC-L1.1
THE IBM PERSONAL SPEECH ASSISTANT
L. COMERFORD, D. FRANK, P. GOPALAKRISHNAN, R. GOPINATH, J. SEDIVY
In this paper, we describe technology and experience with an experimental personal information manager, which interacts with the user primarily but not exclusively through speech recognition and synthesis. This device, which controls a client PDA, is known as the Personal Speech Assistant (PSA). The PSA contains complete speech recognition, speech synthesis and dialog management systems. Packaged in a hand-sized enclosure, of size and physical design to mate with the popular Palm III personal digital assistant, the PSA includes its own battery, microphone, speaker, audio input and output amplifiers, processor and memory. The PSA supports speaker-independent English speech recognition using a 500-word vocabulary, and English speech synthesis on an arbitrary vocabulary. We survey the technical issues we encountered in building the hardware and software for this device, and the solutions we implemented, including audio system design, power and space budget, speech recognition in adverse acoustic environments with constrained processing resources, dialog management, appealing applications, and overall system architecture.

3:50, SPEC-L1.2
SPEAKER- AND LANGUAGE-INDEPENDENT SPEECH RECOGNITION IN MOBILE COMMUNICATION SYSTEMS
O. VIIKKI, I. KISS, J. TIAN
In this paper, we investigate the technical challenges that are faced when making a transition from the speaker-dependent to speaker-independent speech recognition technology in mobile communication devices. Due to globalization as well as the international nature of the markets and the future applications, speaker independence implies the development and use of language-independent ASR to avoid logistic difficulties. We propose here an architecture for embedded multilingual speech recognition systems. Multilingual acoustic modeling, automatic language identification, and on-line pronunciation modeling are the key features which enable the creation of truly language- and speaker-independent ASR applications with dynamic vocabularies and sparse implementation resources. Our experimental results confirm the viability of the proposed architecture. While the use of multi-lingual acoustic models degrades the recognition rates only marginally, a recognition accuracy decrease of approximately 4% is observed due to sub-optimal on-line text-to-phoneme mapping and automatic language identification. This performance loss can nevertheless be compensated by applying acoustic model adaptation techniques.

4:10, SPEC-L1.3
MIPAD: A MULTIMODAL INTERACTION PROTOTYPE
X. HUANG, A. ACERO, C. CHELBA, L. DENG, J. DROPPO, D. DUCHENE, J. GOODMAN, H. HON, D. JACOBY, L. JIANG, R. LOYND, M. MAHAJAN, P. MAU, S. MEREDITH, S. MUGHAL, S. NETO, M. PLUMPE, K. STEURY, G. VENOLIA, K. WANG, Y. WANG
Dr. Who is a Microsoft's research project aiming at creating a speech-centric multimodal interaction framework, which serves as the foundation for the .NET natural user interface. MiPad is the application prototype that demonstrates compelling user advantages for wireless Personal Digital Assistant (PDA) devices, MiPad fully integrates continuous speech recognition (CSR) and spoken la nguage understanding (SLU) to enable users to accomplish many common tasks using a multimodal interface and wireless technologies. It tries to solve the problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs. Unlike a cellular phone, MiPad avoids speech-only interaction. It incorporates a built-in microphone that activates whenever a field is selected. As a user taps the screen or uses a built -in roller to navigate, the tapping action narrows the number of possible instructions for spoken understanding. MiPad currently runs on a Windows CE Pocket PC with a Windows 2000 machine where speech recognition is performed. The Dr Who CSR engine uses a unified CFG and n -gram language model. The Dr Who SLU engine is based on a robust char t parser and a plan -based dialog manager. This paper discusses MiPad's design, implementation work in progress, and preliminary user study in comparison to the existing pen-based PDA interface.

4:30, SPEC-L1.4
UBIQUITOUS SPEECH PROCESSING
S. FURUI, K. IWANO, C. HORI, T. SHINOZAKI, Y. SAITO, S. TAMURA
In the ubiquitous (pervasive) computing era, it is expected that everybody will access information services anytime anywhere, and these services are expected to augment various human intelligent activities. Speech recognition technology can play an important role in this era by providing: (a) conversational systems for accessing information services and (b) systems for transcribing, understanding and summarizing ubiquitous speech documents such as meetings, lectures, presentations and voicemails. In the former systems, robust conversation using wireless handheld/hands-free devices in the real mobile computing environment will be crucial and as will multimodal speech recognition technology. To create the latter systems, the ability to understand and summarize speech documents is one of the key requirements. This paper presents technological perspectives and introduces several research activities being conducted from these standpoints in our research group.

4:50, SPEC-L1.5
ON THE IMPLEMENTATION OF ASR ALGORITHMS FOR HAND-HELD WIRELESS MOBILE DEVICES
R. ROSE, S. PARTHASARATHY, B. GAJIC, A. ROSENBERG, S. NARAYANAN
This paper is concerned with the implementation of automatic speech recognition (ASR) based services on wireless mobile devices. Techniques are investigated for improving the performance of ASR systems in the context of the devices themselves, the environments that they are used in, and the networks they are connected to. A set of ASR tasks and ASR system architectures that are applicable to a wide range of simple mobile devices is presented. A prototype ASR based service is defined and the implementation of the service on a wireless mobile device is described. A database of speech utterances was collected from a population of fifty users interacting with this prototype service in multiple environments. An Experimental study was performed where model compensation procedures for improving acoustic robustness and lattice rescoring procedures for reducing task perplexity were evaluated on this speech corpus.

5:10, SPEC-L1.6
ACOUSTIC SYNTHESIS OF TRAINING DATA FOR SPEECH RECOGNITION IN LIVING ROOM ENVIRONMENTS
V. STAHL, A. FISCHER, R. BIPPUS
Despite continuous progress in robust automatic speech recognition in recent years acoustic mismatch between training and test conditions is still a major problem. Consequently, large speech collections must be conducted in many environments. An alternative approach is to generate training data synthetically by filtering clean speech with impulse responses and/or adding noise signals from the target do-main. We compare the performance of a speech recognizer trained on recorded speech in the target domain with a sys-tem trained on suitably transformed clean speech. In order to obtain comparable results, our experiments are based on two channel recordings with a close talk and a distant micro-phone which produce the clean signal and the target domain signal respectively. By filtering and adding noise we obtain error rates which are only 10% higher for natural number recognition and 30% higher for a command recognition task compared to training with target domain data.