Abstract: Novel techniques in speech recognition are often hampered by the long road that must be followed to turn them into fully functional systems capable of competing with the state-of-the-art. In this work, we explore the use of Segmental Conditional Random Fields (SCRFs) as an integrating technology which can augment the best conventional systems with information from novel scientific approaches. We begin by describing the methodology and its relationship to other methods such as augmented statistical models and structured SVMs. We then illustrate the approach with work done at Microsoft and Johns Hopkins University, in which we find that the SCRF framework is able to appropriately weight different information sources, as varied as phoneme detections and template matching scores, to produce significant gains on Broadcast News, Wall Street Journal, and Voice Search tasks. The talk concludes with a discussion of the research challenges associated with SCRFs.
Return to Keynote and Invited Speakers
Abstract: The robustness of speech recognition systems to acoustic variability is a key factor to their success or failure. This variability can arise from multiple sources, often simultaneously, including environmental noise, reverberation, speaker, and bandwidth. In this talk, we will discuss techniques that can be used to mitigate such variability and reduce the mismatch between the observed speech seen at runtime and the recognizer's acoustic models. We will compare and contrast front-end methods that enhance the signal or features with model-domain methods that adapt the HMM parameters. While most algorithms target a particular source of variability, we will also introduce methods that jointly compensate for multiple sources of mismatch. Although robustness algorithms are often evaluated using a recognizer trained from clean speech, most large-scale commercial systems are built from data collected in the field from real users. We will show how the described robustness techniques can be incorporated into training to reduce the unwanted variability in such data and create more accurate systems. Finally, we'll look at the role of robustness algorithms in commercial applications such as in-car infotainment systems and voice search on smartphones and discuss open challenges that have yet to be addressed.
Return to Keynote and Invited Speakers
Abstract: The IWSLT evaluation campaigns have been offering since 2004 challenging research tasks and open experimental infrastructure for the scientific community working on spoken language translation. This year's evaluation has focused on the translation of TED talks, a collection of public speeches covering a large variety of topics. This task presents several interesting research issues for speech recognition and speech translation technology, such as open domain ASR and MT, clean transcription of spontaneous speech, talk style and topic adaptation, speech translation evaluation, just to mention a few. In my talk I will survey the main outcomes of this exercise, which collected results by 15 research teams from around the world. Finally, I will discuss future developments around spoken language translation that will be investigated under the IWSLT umbrella and also within a large EU-funded project to be lunched in early 2012.
Return to Keynote and Invited Speakers
Abstract: Speech synthesis is often regarded as a messy problem. This talk will discuss how we can formulate the problem of speech synthesis in a statistical machine learning framework. The basic problem of speech synthesis can be stated as follows:
We have a speech database, i.e., a set of speech waveforms and corresponding texts. Given a text to be synthesized, what is the speech waveform corresponding to the text?
The whole text-to-speech generation process can be decomposed into feasible subproblems, which can also be combined as a statistical model for training. One of the subproblems is statistical parametric speech synthesis, which is called "HMM-based speech synthesis" when we use hidden Markov models (HMMs) as statistical models. The talk will also discuss future challenges and the direction in speech synthesis research.
Return to Keynote and Invited Speakers
Abstract: Multimedia content over the Internet is very attractive, while the spoken part of such content very often tells the core information. It is therefore possible to index, retrieve or browse multimedia content primarily based on the spoken part. If the spoken content can be transcribed into text with very high accuracy, the problem is naturally reduced to text information retrieval. But the inevitable high recognition error rates for spontaneous speech including out-of-vocabulary (OOV) words under a wide variety of acoustic conditions and linguistic context make this never possible. One primary approach, among many others, is to consider lattices with multiple hypotheses in order to include more correct recognition results. This talk will briefly review the approaches and directions along this line, not only search over lattices but those beyond, such as relevance feedback, learning approaches, key term extraction, semantic retrieval and semantic structuring of spoken content.
Return to Keynote and Invited Speakers
Abstract: The expression and experience of human behavior are complex and multimodal, and are characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer important means into measuring and modeling human behavior. In fact, observational research and practice across a variety of domains from commerce to healthcare rely on speech and language based informatics. Consider for example the domain of Autism where crucial diagnostic information comes from audiovisual data of verbal and nonverbal behavior. Similar reliance on observed interactions is common across therapeutic settings in mental health. Behavioral signal processing advances can enable not only new possibilities for gathering data in a variety of settings--from laboratory and clinics to free living conditions—but promise computational techniques and models to advance evidence-driven theory and practice.
This talk will describe some ongoing efforts on Behavioral Signal Processing—technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior— with a specific focus on communicative, affective and social behavior. Using examples drawn from different domains, the talk will illustrate Behavioral Informatics applications of these processing techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. In particular, we will draw on examples from work on health domains to illustrate the challenges and opportunities for behavioral speech and spoken signal processing. [Work supported by NIH, DARPA, ONR, and NSF].
Return to Keynote and Invited Speakers