Teruhisa Misu and Tatsuya Kawahara (Kyoto Univ., Japan)
We propose an interactive framework for information navigation based on document knowledge base. In conventional audio guidance systems, such as those deployed in museums, the information flow is one-way and the content is fixed. To make the guidance interactive, we prepare two modes, a user-initiative retrieval/QA mode (pull-mode) and a system-initiative recommendation mode (push-mode), and switch between them according to the user’s state. In the user-initiative retrieval/QA mode, the user can ask questions about specific facts in the documents in addition to general queries. In the system-initiative recommendation mode, the system actively provides the information the user would be interested in. We implemented a navigation system containing Kyoto city information.
Toru Imai, Akio Kobayashi, Shoei Sato, Shinichi Homma, Takahiro Oku, and Tohru Takagi (NHK, Japan)
NHK has been operating Japanese large-vocabulary continuous speech recognition systems for closed-captioning some of its news, sports and other live TV programs to help the hearing impaired and elderly since 2000. The first implementation was for news programs where anchorpersons’ read speech in a studio was directly recognized and any recognition error was promptly corrected manually by operators using touch-panels and keyboards. The second one was a so-called “re-speak” method where another speaker listening to the original speech of the programs rephrases the commentary so that it can be recognized for captioning live programs, e.g. music shows, baseball games, the Grand Sumo Tournaments, the Olympic Games, and World Cup Football Games. The captioning systems have been daily operated and we have been receiving a large number of positive responses from hearing impaired viewers about them. In the demonstration session, we will demonstrate to recognize a sports commentary directly from the broadcasting studio for a Major League Baseball (MLB) game by using a laptop PC, and synchronously show actual broadcast of its error-corrected closed-captions on the video. Notable technical features of the demonstration are our very low latency decoder, which is suitable for real-time captioning, with the adapted acoustic model to the commentator, the adapted language model to the MLB, and the closed-captions after the immediate manual error correction.
Hoon Chung, Jeon Gue Park, Yun Keun Lee and Ikjoo Chung (ETRI, Korea)
We have developed a speech recognition system that recognizes hundreds of thousands of item names on a resource-limited embedded device without serious accuracy degradation for N-best result. In order to implement such an efficient speech recognition system, we used subspace distribution clustering hidden Markov (SDCHMM)-based acoustic models to achieve memory efficiency and proposed a multistage based fast search scheme. The proposed algorithm is composed of a two-stage HMM-based coarse match and a detailed match. The two-stage HMM-based coarse match is aimed at rapidly selecting a small set of candidates that are assumed to contain a correct hypothesis with high probability, and the detailed match re-ranks the candidates by performing acoustic rescoring. Principally, the algorithm shares the architecture of human speech recognition (HSR) and multi-layered framework in that speech recognition is completed through a three stage decoding procedure: acoustic feature to phoneme conversion, phoneme to word conversion and word level rescoring. However, the contribution of our work is that we present another statistical framework to deal with the first two steps, especially optimized to maximize search speed. The proposed system is implemented on an in-car navigation system with a 32-bit fixed-point processor operating at 620MHz. The experimental result shows that the proposed method runs at maximum speed 1.74 times real-time on the embedded device while consuming 7.5MB of working memory for a 220K Korean Point-of-Interest (POI) recognition domain.
Abhishek Chandel, Abhinav Parate, Maymon Madathingal, Himanshu Pant,Nitendra Rajput, Shajith Ikbal, Om Deshmukh, Ashish Verma (IBM, India)
At IBM’s India Research Lab we have developed an interactive web-enabled tool, called Sensei, to evaluate various parameters of spoken English skills including articulation of various phones, lexical stress pattern of syllables in a word and spoken grammar. We have a paper accepted on Sensei in this conference [1]. The demonstration will provide a live experience of all the three main modules of Sensei: the user interaction and interface module, the speech processing module and the content and configuration management module. The user interaction and interface module delivers the audio data from the server to the user’s web browser, transfers the audio recordings from the user to the server and guides the user through the various stages of the tool. The speech processing module uses the speech recognition engine to recognize the spoken utterance, obtain the phonetic alignments and confidence scores. The phonetic alignments along with a phone-to-syllable mapping are used to compute syllable level prosodic features which classify the syllables of the spoken word in correct or incorrect classes. For articulation evaluation, the speech processing module combines the confidence scores obtained during the phonetic alignment to compute an articulation score for the spoken utterance. The module also computes a combined score for the overall assessment of the user. The content and configuration management module controls the nature and the difficulty level of the tool by altering the database used for evaluating the various parameters, changing the time allotted for each of the evaluation type, changing the number of allowed attempts to record user’s input and so on. The demonstration will also provide the audience with an opportunity to use the tool to evaluate their spoken English skills, receive scores on the individual parameters as well as the combined score. Our current efforts on developing a learning component as part of the Sensei tool that can point out the mistakes committed by the user and provide feedback to improve the user’s spoken English skills will also be showcased.
[1] A. Chandel et. al., “Sensei: Spoken Language Assessment for Call Center Agents”, [paper number 1098]
Ian McGraw and Stephanie Seneff (MIT, USA)
Acquiring a second language as an adult is a monumental task regardless of circumstance. Compounding the difficulty is the fact that many language learners do not have the opportunity to speak in meaningful contexts outside of the classroom. In recent years, MIT’s Spoken Language Systems (SLS) group has been developing dialogue systems for language learners to alleviate this need. These systems provide a non-threatening environment for the student to practice their speech in a one-on-one setting. One drawback to such dialogue systems, however, is that they inherently must cover a small domain to ensure that the natural language processing (NLP) and automatic speech recognition (ASR) components are provided with enough constraints to perform robustly. The second language acquisition theory community, however, consistently presents a case for learner-centered classrooms, in effect, giving the student a significant amount of freedom in choosing the course contents. This necessitates one of two solutions: 1) develop many different narrow-domain dialogue systems from which students can choose, or 2) give the user the ability to personalize the content loaded into a single system. In ASRU 2007, we would like to demonstrate a speech-enabled card game that takes a step in direction of the second solution presented above. Using the same technology that underlies such popular online applications as GMail©, we have constructed a web-site where a student of Mandarin Chinese can easily create and save a deck of image-based flash-cards from within an ordinary Internet browser. Subsequently, the student can load their flash-cards into “Word-War, a simple card game environment built directly into the site. A speech-based system is automatically configured which allows the user to manipulate the cards entirely through spoken commands uttered in Mandarin. “Word-War” makes two additional contributions to the community interested in ASR for second language learners. First, our framework supports the ability to recognize, understand, and react to partial utterances in real-time. Using this feature, “Word-War” provides immediate visual feedback, allowing users to simultaneously speak and check that their utterances are understood, while side-stepping the issues associated with verbal barge-in. Second, an engaging, multi-player mode connects two students via the web in a head-to-head, vocabulary-building“ battle of words”. These three features, personalization, real-time visual feedback, and multi-party interaction, are combined into a single, web-based application that we believe presents a particularly compelling use of speech technology in education.
[1] Short video demonstration available at http://people.csail.mit.edu/imcgraw/cardsdemo
Donghyeon Lee, Jonghoon Lee, Gary Geunbae Lee (POSTECH, Korea)
In this demonstration, we present POSSLT (POSTECH Spoken Language Translation) for a Korean-English statistical spoken language translation (SLT) system using pseudo-morpheme and confusion network (CN) based technique. Like most other SLT systems, automatic speech recognition (ASR) and machine translation (MT) are coupled in a cascading manner in our SLT system. We used confusion network based approach to couple ASR and MT. It has better translation quality and faster decoding time than N-best approach. In the ASR and SMT for Korean, how to define processing units affects the performance. Pseudo-morpheme unit is a best choice for Korean-English SLT. Models used in SLT system are trained on a travel domain conversational corpus.
Masahiro Araki (Kyoto Inst. Tech., Japan)
This demo shows our very rapid prototyping system for voice-enabled Web application, Vrails. It is based on one of Rails family Web application framework, Grails. Rails frameworks follow MVC (Model-View-Controller) model for interactive system development which clearly separates application logic (model) and user interface part (view) intermediated by controller. Contrary to ordinary state-based prototyping tools for spoken dialogue systems, such as CSLU toolkit, Vrails starts with a definition of data structure then generates all the rest components automatically. The controller part and the model part are automatically generated following the ‘Convention over Configuration’ strategy, which means basic operations such as create, read, update, delete data are already prepared, and burdensome binding definition between objects in scripting language and database records can be omitted. Our contribution to this Rails framework is (1) to add voice interaction part to view files (as XHTML+Voice) automatically, (2) to generate grammar definition following the data definition and (3) to add more system-directive interaction pattern in order to apply mobile device.
Yasuhiro Minami, Minako Sawaki, Ryuichiro Higashinaka, Kohji Dohsaka, Takeshi Yamada,Tatsushi Matsubayashi, Hideki Isozaki, and Eisaku Maeda (NTT, Japan)
Our new research project, called “ambient intelligence,” concentrates on the creation of new lifestyles through research on communication science and intelligence integration. It is premised on the creation of such virtual communication partners as fairies and goblins that can serve constantly at our side. We call these virtual communication partners mushrooms. To show the essence of ambient intelligence, we demonstrate a multimodal system: a quizmaster mushroom. The purpose of the quizmaster mushroom is to transmit knowledge from the system to users while they play a quiz game with the system. The system can conduct a “who is this” quiz on certain people selected from the Internet. The system works in real time using speech, dialogue, and vision technologies [1].
[1] Y. Minami,et.al., “The World of Mushrooms: Human-Computer Interaction Prototype Systems for Ambient Intelligence,” Proc. ICMI2007, Nagoya, 2007 (to appear).
Tohru Shimizu, Yutaka Ashikari, Eiichiro Sumita, Satoshi Nakamura (ATR, Japan)
In this demo, we introduce the recent progress of NICT-ATR speech-to-speech translation system. Corpus-based approaches of recognition, translation, and synthesis enable coverage of a wide variety of topics and portability to other languages. In this system, basic component modules of Japanese, English and Chinese are implemented in the terminal, and the system also has an interface to access to other speech-to-speech translation resources (e.g. component modules of other language pairs) located in the internet. This system is organized around a module manager that has access to speech recognition (ASR), machine translation (MT), speech synthesis (SS) and user interface (UI) modules. The module manager has the function of controlling information comprising speech data, recognized or translated text data, and system messages between component modules. This architecture and event based processing make it easier to extend the configuration to handle source and target languages, tasks and domains. To realize connections between internal and external speech-to-speech translation resources (e.g. ASR, MT, SS servers of other language / language pairs) in the internet, we define a first draft of Machine Translation Markup Language (STML) and implemented web services using STML. The speech-to-speech translation system is designed for use with mobile terminals. The size of the PC is W150 mm x D32 mm x H95 mm. To use this system in a noisy environment, a uni-directional microphone is used. Speech-to-speech translation can be performed for any combination of Japanese, English, and Chinese languages. As the entire speech-to-speech translation function is implemented into one terminal, it realizes real-time and location free speech-to-speech translation service.
Ye-Yi Wang (Microsoft, USA)
Speech applications need grammars for language model and spoken language understanding. In industrial applications, context free grammars are often used. W3C has recommended Speech Recognition Grammar Specification (SRGS) for grammar standardization, which is supported by many speech recognizers. However, creating a customized grammar in SRGS is still a challenging task that many speech application developers are facing. They have to get familiar with the grammar specification, anticipate possible expressions that users may use to refer to a concept, and script the semantic interpretation tags that map users’ utterances to the corresponding canonical semantic representations. Because of these difficulties, many developers choose to use generic library grammar instead of creating a customized one, which leads to high perplexity hence high recognition errors. SGStudioWA (Semantic Grammar Studio) is a web application that helps speech application developers rapidly create a semantic grammar in SRGS customized for their applications. It takes as inputs high level specifications such as regular expressions for alphanumeric concepts, cardinal/ordinal numbers in a range, etc., and automatically generates grammars with appropriate semantic interpretations. In addition, it can build recognition grammars from user provided examples. The example-based grammar is robust to cover unseen expressions.
Tim Paek, Yun-Cheng Ju, Christopher Meek (Microsoft, USA)
Automated Directory Assistance (ADA) allows users to request telephone or address information of residential and business listings using speech recognition [1]. Because the caller usually does not know the exact name of the listing, it is known that ADA systems require transcriptions of alternative phrasings for directory listings as training data for building language models. Unfortunately, such data can be very costly and time consuming to acquire. Since the introduction of The ESP Game [2], researchers have sought to use games to tackle machine learning problems such as image classification and paraphrasing journalistic sentences for machine translation.
In this demo, we introduce two computer games, one text based (People Watcher [3]) and the other telephony based (Marketeur), that elicit transcribed, alternative user phrasings for directory listings while at the same time entertaining players. Both games are framed as a marketing game that tests people’s ability to “spot social trends” by having them identify who they think would be likely customers of various businesses. In this demo, we’ll describe how these two games work, the user interface design, the technical challenges, the potential applications of the data collected, and summarize how data collected from these games has, so far, helped improving the performance.
[1] Levin, E. & Mane, A.M. (2005). “Voice User Interface Design for
Automated Directory Assistance”, in Proc. Interspeech, pp. 2509-2512.
[2] von Ahn, L. & Dabbish, L. (2004). “Labeling Images with a Computer
Game”, In Proc. CHI, pp. 319-326.
[3] Tim Paek, Yun-Cheng Ju, Christopher Meek (2007). “People Watcher: A Game
for Eliciting Human-Transcribed Data for Automated Directory Assistance”, in
Proc. Interspeech.