The Expressive Power of Word Embeddings Abstract: We
seek to better understand the information encoded in word embeddings. We
propose several tasks that help to distinguish the
characteristics of different publicly released
embeddings. Our evaluation shows that embeddings are able to capture
surprisingly nuanced semantics even in the absence of sentence
structure. Moreover, benchmarking the
embeddings shows great variance in quality
and characteristics of the semantics captured
by the tested embeddings. Finally, we show
the impact of varying the number of dimensions and the resolution of
each dimension on
the effective useful features captured by the
embedding space. Our contributions high-
light the importance of embeddings for NLP
tasks and the effect of their quality on the
final results. ICML version (pdf) Deep Learning for Topical Words and Thematic Sentences
Abstract: This paper presents a hierarchical theme and topic model for deep representation of sentences and words from heterogeneous documents. We extract the latent themes from sentences and simultaneously identify the latent topics from words in different sentences. In this study, we flexibly conduct structural learning according to the Bayesian nonparametrics where the numbers of themes and topics are unknown. A tree stick-breaking process is proposed to determine the theme proportions for sentence representation. Hierarchical Dirichlet process is adopted to sample the topical words of a text corpus under the same theme. In the experiments, the proposed method is evaluated to be effective for finding topical words and thematic sentences in DUC 2007 corpus. ICML version (pdf) Text Segmentation with Character-level Text EmbeddingsG. Chrupala Abstract: Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is of- ten unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture of natural language strings and other character data. We propose to learn text representations directly from raw character sequences by training a Simple Recurrent Network to predict the next character in text. The net- work uses its hidden layer to evolve abstract representations of the character sequences it sees. To demonstrate the usefulness of the learned text embeddings, we use them as features in a supervised character level text segmentation and labeling task: recognizing spans of text containing programming language code. By using the embeddings as features we are able to substantially improve over a baseline which uses only surface character n-grams. ICML version (pdf) Deep Learning Based on Manhattan Update RuleY. Hifny Abstract: Acoustic models based on Deep Neural Net- works (DNNs) lead to significant improvement in the recognition accuracy. In these methods, Hidden Markov Models (HMMs) state scores are computed using flexible discriminant DNNs. Training DNNs is computationally expensive and efficient training of DNNs is an active area of research. Similar to HMMs, Deep Conditional Random Fields (DCRFs) use DNNs to compute state scores. In this paper, we present a method to estimate DCRFs using the Manhattan (MH) update rule. The Manhattan update rule does not involve the gradient magnitude. The Manhattan update method is general and can be used to estimate models when the gradient can be computed. ICML version (pdf) Acoustic Modeling Based on Deep Conditional Random Fields Y. Hifny
Abstract: Acoustic modeling based on Hidden Markov
Models (HMMs) is employed by state-of-the-
art stochastic speech recognition systems. In
continuous density HMMs, the state scores
are computed using Gaussian mixture models. On the other hand, Deep Neural Networks (DNN) can be used to compute the
HMM state scores. This leads to significant improvement in the recognition accuracy. Conditional Random Fields (CRFs)
are undirected graphical models that maintain the Markov properties of Hidden Markov
Models (HMMs), formulated using the maximum entropy (MaxEnt) principle. It is possible to use DNN to compute the state scores
in CRFs. Using CRFs on the top of DNN
will lead to an acoustic model known as Deep
Conditional Random Fields (DCRFs). In
this paper, we present a phone recognition
task based on DCRFs. Preliminary results
on the TIMIT task show that DCRFs can
lead to good results.
ICML version (pdf) Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition
N. Jaitly and G. HintonAbstract: Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. However this approach has (to our knowledge) not been exploited successfully in speech recognition (with or without neural networks). In this paper we lay the foundation for this approach, and show one way of augmenting speech datasets by transforming spectrograms, using a random linear warping along the frequency dimension. In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than fitting a single warp factor to each training and test speaker (or utterance). At test time, a prediction is made by averaging the predictions over multiple warp factors. When this technique is applied to TIMIT using Deep Neural Networks (DNN) of different depths, the Phone Error Rate (PER) improved by an average of 0.65% on the test set. For a Convolutional neural network (CNN) with convolutional layer in the bottom, a gain of 1.0% was observed. These improvements were achieved without increasing the number of training epochs, and suggest that data transformations should be an important component of training neural networks for speech, especially for data limited projects. ICML version (pdf) Rectifier Nonlinearities Improve Neural Network Acoustic Models Abstract: Deep neural network acoustic models produce substantial gains in large vocabulary continuous speech recognition systems. Emerging work with rectified linear (ReL) hidden units demonstrates additional gains in final system performance relative to more commonly used sigmoidal nonlinearities. In this work, we explore the use of deep rectifier networks as acoustic models for the 300 hour Switchboard conversational speech recognition task. Using simple training procedures without pretraining, networks with rectifier nonlinearities produce 2% absolute reductions in word error rates over their sigmoidal counterparts. We analyze hidden layer representations to quantify differences in how ReL units encode inputs as compared to sigmoidal units. Finally, we evaluate a variant of the ReL unit with a gradient more amenable to optimization in an attempt to further improve deep rectifier networks. ICML version (pdf) Effect of Non-linear Deep Architecture in Sequence Labeling Abstract: If we
compare the widely used Conditional
Random Fields (CRF) with newly proposed
“deep architecture” sequence models (Collobert et al., 2011), there are
two things
changing: from linear architecture to non-
linear, and from discrete feature representation to distributional. It
is unclear, however, what utility non-linearity offers in conventional
feature-based models. In this
study, we show the close connection between CRF and “sequence model”
neural
nets, and present an empirical investigation to compare their
performance on two
sequence labeling tasks – Named Entity
Recognition and Syntactic Chunking. Our
results suggest that non-linear models are
highly effective in low-dimensional distributional spaces. Somewhat
surprisingly, we
find that a non-linear architecture offers no
benefits in a high-dimensional discrete feature space. ICML version (pdf)
|