ASRU 2015: Invited and Keynote Speakers

Do Deep Nets Really Need To Be Deep?

Rich Caruana - Microsoft Research

Abstract:

Deep neural networks are the state of the art on problems such as speech recognition and computer vision. Using a method called model compression, we show that shallow nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models while using the same number of parameters as the original deep models. On the TIMIT speech recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. The same model compression trick also can be used to compress impractically large deep models and ensembles of large deep models down to small- or medium-size deep models that run more efficiently on mobile devices or servers.

Biography

Rich Caruana is a Senior Researcher at Microsoft Research. Before joining Microsoft, Rich was on the faculty in the Computer Science Department at Cornell University, at UCLA’s Medical School, and at CMU’s Center for Learning and Discovery (CALD). Rich’s Ph.D. is from Carnegie Mellon University, where he worked with Tom Mitchell and Herb Simon. His thesis on Multi-Task Learning helped generate interest in a new subfield of machine learning called Transfer Learning. Rich received an NSF CAREER Award in 2004 (for Meta Clustering), best paper awards in 2005 (with Alex Niculescu-Mizil), 2007 (with Daria Sorokina), and 2014 (with Todd Kulesza, Saleema Amershi, Danyel Fisher, and Denis Charles), co-chaired KDD in 2007 (with Xindong Wu), and serves as area chair for NIPS, ICML, and KDD. His current research focus is on learning for medical decision making, deep learning, adaptive clustering, and computational ecology.

Bayes decision rule and word error rate in ASR: Some principal and mathematical issues

Hermann Ney - RWTH Aachen University

Abstract:

Today's most successful systems for ASR are based on Bayes decision rule, which under suitable conditions guarantees optimum performance (or equivalently minimum error rate). However, when revisiting the design and implementation of today's ASR systems from this point of view, there are many principal or mathematical issues that are still open and have not been addressed adequately.

Examples are: 1) The true distributions as required by the Bayes decision rule are different from the model distributions learned from data. What is the effect of these mismatch conditions on the use of the Bayes decision rule? What is the effect on the training method? 2) The performance measure used in ASR is the word error rate (edit distance) whereas the decision rule typically implemented is the rule for minimum sentence error. What is the effect of this discrepancy between decision rule and performance measure? 3) ASR systems use the language model as a key concept to provide context information. What is the effect of the language model perplexity on the word error rate

Biography

Hermann Ney is a full professor of Computer Science at RWTH Aachen University of Technology in Aachen, Germany. His research interests include statistical classification and machine learning with applications to speech recognition, machine translation and text image recognition. In addition to other topics, he and his team have mainly worked on decoding, language modelling and acoustic modelling for speech recognition and on alignment models, phrase-based modelling and decoding for machine translation. Specifically, his contributions to statistical machine translation have initiated a revolutionary development around 2002 in the field. His work has resulted in more than 700 conference and journal papers. According to Google scholar, his h-index is 81 with a total of 33000 citations. He is a fellow of both the IEEE and of the International Speech Communication Association. In 2005, he was the recipient of the Technical Achievement Award of the IEEE Signal Processing Society. In 2013, he received the award of honour from the International Association of Machine Translation.

Graphical Models Over String-Valued Random Variables

Jason Eisner - Johns Hopkins University

Abstract:

Natural language processing must sometimes consider the internal structure of words, e.g., in order to understand or generate an unfamiliar word. Unfamiliar words are systematically related to familiar ones due to linguistic processes such as morphology, phonology, abbreviation, copying error, and historical change.

We will show how to build joint probability models over many strings. These models are capable of predicting unobserved strings, or predicting the relationships among observed strings. However, computing the predictions of these models can be computationally hard. We outline approximate algorithms based on Markov chain Monte Carlo, expectation propagation, and dual decomposition. We give results on some NLP tasks.

Biography

Jason Eisner is Professor of Computer Science at Johns Hopkins University. His goal is to develop the probabilistic modeling, inference, and learning techniques needed for a unified model of all kinds of linguistic structure. He has worked on computational approaches to phonology, morphology, syntax, and semantics, as well as applied problems such as information extraction and machine translation. He is also the lead designer of Dyna, a declarative programming language that aims to provide an infrastructure for AI research. He has received two school-wide awards for excellence in teaching.

Accelerating HPC – From Simulating to Learning the World

Jerry Chen - NVIDIA

Abstract:

GPUs for accelerating general purpose computing is widely accepted in the HPC community. GPUs now power many of the top supercomputers around the world, including the fastest supercomputers in the US and Europe. Perhaps the most exciting byproduct of this innovation is the availability of affordable commodity GPU computing resources for the data science community.

Industries like media, medicine, consumer electronics, transportation, retail, and security are being transformed by deep learning applications that learn how to represent the world, thereby making better predictions to help improve our lives. Speech and natural language processing are among the hottest areas of this approach.

Biography

Jerry Chen is responsible for developing the partner ecosystem for data science and machine learning at NVIDIA. Past roles include management positions in professional graphics, scientific HPC, and structural mechanics. He holds a Bachelors and Masters of Engineering from Cornell University and an MBA from the University of California at Berkeley.

Invited Speakers

Recurrent Neural Nets as Sequence Decoders

Oriol Vinyals - Research Scientist at Google Brain

Abstract:

This past year, RNNs have seen a lot of attention as powerful models that are able to decode sequences from signals. In this talk, I'll review some recent successes on machine translation, image understanding, and beyond. The key component of such methods are the use of a recurrent neural network architecture that is trained end-to-end to optimize the probability of the output sequence given those signals.

Biography

Oriol Vinyals is a Research Scientist at Google. He works in deep learning with the Google Brain team. Oriol holds a Ph.D. in EECS from University of California, Berkeley, and a Masters degree from University of California, San Diego. He is a recipient of the 2011 Microsoft Research PhD Fellowship. He was an early adopter of the new deep learning wave at Berkeley, and in his thesis he focused on non-convex optimization and recurrent neural networks. At Google Brain he continues working on his areas interests, which include artificial intelligence, with particular emphasis on machine learning, language, and vision.

Acoustic modeling for speech synthesis: from HMM to RNN

Heiga Zen - Google

Abstract:

Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. There have been attempts to replace the HMMs by alternative acoustic models, which provide trajectory and context modeling. Recently, artificial neural network-based acoustic models, such as deep neural networks, mixture density networks, and recurrent neural networks (RNNs), showed significant improvements over the HMM-based one. This talk reviews the progress of acoustic modeling in SPSS from the HMM to the RNN.

Biography

Heiga Zen received his PhD from the Nagoya Institute of Technology, Nagoya, Japan, in 2006. Before joining Google in 2011, he was an Intern/Co-Op researcher at the IBM T.J. Watson Research Center, Yorktown Heights, NY (2004--2005), and a Research Engineer at Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK (2008--2011). His research interests include statistical speech synthesis and recognition. He was one of the original authors and the first maintainer of the HMM-based speech synthesis system, HTS (http://hts.sp.nitech.ac.jp).

Boosting Distant Speech Recognition Using Multiple Microphones: Frontend Approaches

Tomohiro Nakatani - NTT Corporation

Abstract:

When we capture speech by distant microphones, various interfering sounds, such as ambient noise, reverberation, and extraneous speakers' voice, are inevitably included in the captured signals and deteriorate the features of speech. This makes ASR using distant microphones, or distant speech recognition (DSR), very difficult. As a promising solution to this problem, this talk will discuss multiple microphone based frontend approaches, including denoising, dereverberation, and source separation. Using several challenging DSR scenarios, including the REVERB Challenge, the CHiME-3 Challenge, and a meeting recognition scenario, it will be shown that the frontend approaches could work well together with the state-of-the-art ASR backends using DNN acoustic models, and greatly improve the performance of DSR.

Biography

Tomohiro Nakatani is a Senior Research Scientist (Supervisor) at Communication Science Laboratories., NTT Corporation, Japan. He received a Ph.D. degree in informatics from Kyoto University, Kyoto, Japan in 2002. Since he joined NTT in 1991, he has been investigating statistical signal processing technologies for analyzing speech signals captured in real acoustical environments and developed various algorithms for dereverberation, denoising, blind source separation, and robust ASR. He presented a tutorial on reverberant speech processing at the IEEE ICASSP-2012, and served as a co-chair of the REVERB Challenge Workshop held in 2014. He is an associate member of the IEEE SPS Audio and Acoustic Signal Processing TC.

Situated Dialog: Opportunities and Challenges

Dan Bohus - Microsoft Research

Abstract:

Most research to date on spoken language interaction has focused on supporting dialog with single users in limited domains and contexts. Significant progress in this space has enabled wide-scale deployments of telephony-based systems, multimodal voice search applications and voice-enabled personal assistants.

At the same time, numerous and important challenges remain largely unaddressed in the realm of physically situated spoken language interaction (e.g., in-car systems, robots in public spaces, ambient assistants). In this talk, I will outline a core set of communicative competencies required for supporting dialog in physically situated settings – such as models of multiparty engagement, turn-taking and interaction planning, and I will present samples of work as part of a broader research agenda in this area. In the process I will highlight the need for multimodal, incremental and joint reasoning in situated dialog.

Biography

Dan Bohus is a Senior Researcher in the Adaptive Systems and Interaction Group at Microsoft Research. His research agenda is focused on physically situated, open-world spoken language interaction. Before joining Microsoft Research, Dan has received his Ph.D. degree (2007) in Computer Science from Carnegie Mellon University.

Bridging Rule and Data-Driven Models for Robust Dialogue State Tracking

Kai Yu - Shanghai Jiao Tong University

Abstract:

Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states at each dialogue turn given the interaction history. It is a key component of statistical dialogue management. Although data-driven approaches are of most interest in DSTs and have defined the state-of-the-art performance, the performances of these approaches are highly dependent on the availability of training data and the generalization ability is often weak. There also have been attempts of using rule-based methods for DST, due to their simplicity, efficiency and portability. However, the performances of rule-based methods are usually not competitive to carefully designed data-driven tracking approaches and it is not possible to improve the DST performance when training data are available. It is then of interest to study whether advantages of both frameworks can be achieved at the same time. In this talk, various rule-based and data driven approaches for DST are reviewed. Then, two forms of hybrid approaches are introduced to bridge the two frameworks. One approach is to reformulate Bayesian rules as a polynomial function of a set of probabilities satisfying certain linear constraints, referred to as constrained Markov Bayesian polynomial (CMBP). Prior knowledge, i.e. rules, can be encoded in these constraints and the CMBP coefficients are allowed to be optimized on training data. Another approach is to develop a new form of computation network, where the network structure is suitable to represent polynomial functions and includes recurrent links to memorize the input history. This computation network is referred to as Recurrent Polynomial Network (RPN). The special structure allows it to be initialized using CMBP. Both approaches can incorporate rules and are data-driven. This leads to competitive performance compared with the state-of-the-art data-driven approaches, while keeps the simplicity and good generalization property of rule-based approaches at the same time. Thanks to the Dialog State Tracking Challenges (DSTC) organized in the recent years, a broad range of machine learning approaches have been investigated to advance the research of DST. One interesting scenario in the DSTs is to build a robust tracker for a new domain with very limited in-domain data and relatively large out-of-domain data. This is of practical interest and becomes the primary task of DSTC-3. An introduction of the DSTCs will be given. The evaluation of the aforementioned DST approaches on DSTC-3 will be reported.

Biography

Kai Yu is a research professor at Shanghai Jiao Tong University (SJTU), China. He received the B.Eng. degree in automation and the M.Sc. degree from Tsinghua University, China in 1999 and 2002, respectively. He then joined the Machine Intelligence Lab at the Engineering Department at Cambridge University, U.K., where he obtained the Ph.D. degree in 2006. He worked as research associate and later senior research associate at Cambridge University before joining SJTU. His main research interests lie in the area of speech-based human machine interaction including speech recognition, synthesis, language understanding and dialogue systems. He has served as the area chair of speech processing for InterSpeech 2009, EUSIPCO 2009, 2014, and the area of chair of spoken dialogue systems for InterSpeech 2014. He received three best paper awards of InterSpeech and IEEE Spoken Language Technology and the ISCA Computer Speech and Language Best Paper Award (2008-2012). He was one of the key members to design and implement the Cambridge end-to-end statistical spoken dialogue system, which defined the state-of-the-art in the 2010 Spoken Dialogue Challenge. He was selected into the 1000 Overseas Talent Plan (Young Talent) by Chinese government and the Excellent Young Scientists Project of NSFC China. He is a senior member of IEEE, a member of ISCA and a member of the Technical Committee of the Speech, Language, Music and Auditory Perception Branch of the Acoustic Society of China.

Natural Speech Technology

Steve Renals - University of Edinburghn

Abstract:

Over the past four years the Natural Speech Technology project (www.natural-speech-technology.org) has aimed to address core research challenges in speech recognition and synthesis, driven by a number of weaknesses in current speech technology including: fragile operation across domains; a reliance on manually transcribed training data; models which only weakly factor the underlying sources of variability; and systems which react crudely (or not at all) to the context or environment. In this talk I'll review some of the work we have done over the past four years to address these challenges. In particular, the talk will focus on the adaptation and learning of representations for speech recognition and speech synthesis, and the development of recognition and synthesis systems with wide domain coverage. I'll also illustrate how we have translated this core research into exemplar applications in two areas: media and health.

This research was supported by EPSRC Programme Grant grant no. EP/I031022/1 (Natural Speech Technology), and was performed jointly by researchers at Edinburgh, Cambridge, and Sheffield.

Biography

Steve Renals is Professor of Speech Technology at Edinburgh. He received a PhD in Speech Recognition and Neural Networks from Edinburgh in 1990, and before moving back to Edinburgh in 2003 he held positions at ICSI Berkeley, Cambridge, and Sheffield. His research interests cover speech technology, spoken language processing, and multimodal interaction. He is currently director of the EPSRC Natural Speech Technology programme and coordinator of the EC ROCKIT support action. He is a fellow of the IEEE, senior area editor of IEEE/ACM Transactions on Audio, Speech, and Language Processing, and a member of the advisory board for the ACM International Conference on Multimodal Interaction.