Course:CPSC522/Character Level Language Models using LSTM

Character Level Language Models using LSTM

This page primarily follows ^[1] and ^[2].

Principal Author: Kevin Dsouza
Collaborators:

Abstract

This page covers character level language models implemented using Long-Short-Term-Memory-Networks (LSTMs). The first half introduces a Character-Aware neural model and the second half builds on this idea to follow a hierarchical structure in building character level models.

Builds on

This page builds on Recurrent Neural Networks and Natural Language Processing.

Related Pages

Character level language models depart from Word level language models and come under the general category of Language models.

Paper 1: Character-Aware Neural Language Models

Introduction

Figure 1: The overall working of the model. (Source: ^[1])

Traditional methods in language modeling involve making an n-th order Markov assumption and estimating n-gram probabilities via counting. The count-based models are simple to train, but due to data sparsity, the probabilities of rare n-grams can be poorly estimated. Neural Language Models (NLM) address the issue of n-gram data sparsity by utilizing word embeddings ^[3]. These word embeddings derived from NLMs exhibit the property whereby semantically close words are close in the induced vector space. Even though NLMs outperform count-based n-gram language models ^[4], they are oblivious to subword information (e.g. morphemes). Embeddings of rare words can thus be poorly estimated, leading to high perplexities (Perplexity is the measure of how well a probability distribution predicts a sample) which is especially problematic in morphologically rich languages.

In this work, the authors propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNN-LM). Unlike previous works that utilize subword information via morphemes ^[5], this model does not require morphological tagging as a pre-processing step.

Long-Short-Term-Memory

Long short-term memory (LSTM) ^[6] addresses the problem of learning long-range dependencies in the Recurrent Neural Networks by adding a memory cell vector $c_{t}$ $\in$ $R^{n}$ at each time step. One step of an LSTM takes as input $x_{t}$ (input vector at time $t$ ), $h_{t-1}$ (hidden state vector at $t-1$ ), $c_{t-1}$ (memory cell vector at time $t-1$ ) and produces $h_{t}$ (hidden state vector at time $t$ ), $c_{t}$ (memory cell vector at time $t$ ) via the following intermediate calculations:

i_{t}=\sigma (W^{i}x_{t}+U^{i}h_{t-1}+b^{i})

f_{t}=\sigma (W^{f}x_{t}+U^{f}h_{t-1}+b^{f})

o_{t}=\sigma (W^{o}x_{t}+U^{o}h_{t-1}+b^{o})

g_{t}=tanh(W^{g}x_{t}+U^{g}h_{t-1}+b^{g})

c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot g_{t}

h_{t}=o_{t}\odot tanh(c_{t})

Here $\sigma ()$ and $tanh()$ are the element-wise sigmoid and hyperbolic tangent functions, $\odot$ is the element-wise multiplication operator, and $i_{t}$ , $f_{t}$ , $o_{t}$ are referred to as input, forget, and output gates. At $t=1$ , $h_{0}$ and $c_{0}$ are initialized to zero vectors. Parameters of the LSTM are $W_{j},U_{j},b_{j}$ for $j\in (i,f,o,g)$ .

Memory cells in the LSTM are additive with respect to time, alleviating the vanishing gradient problem. This is the most important distinction of LSTM in the sense that it allows for an uninterrupted gradient flow through the memory cell. Gradient exploding is still an issue, though in practice simple gradient clipping works well. LSTMs have outperformed vanilla RNNs on many tasks, including on language modeling ^[7].

Recurrent Neural Network Language model (RNNLM)

Let $V$ be the fixed size vocabulary of words. A language model specifies a distribution over $w_{t+1}$ (whose support is $V$ ) given the historical sequence $w_{1:t}=[w_{1},...,w_{t}]$ . A recurrent neural network language model (RNN-LM) applyies an affine transformation to the hidden layer followed by a softmax:

P(w_{t+1}=j|w_{1:t})={\frac {exp(h_{t}p^{j}+q^{j})}{\sum _{k\in V}exp(h_{t}p^{k}+q^{k})}}

Figure 3: Performance of the model versus other neural language models on the English Penn Treebank test set. (Source: ^[1])

where $p_{j}$ is the $j$ th column of $P\in R^{m\times |V|}$ (output embedding), and $q_{j}$ is a bias term. If $w_{1:T}=[w_{1},...,w_{T}]$ are the sequence of words in the training corpus, training involves minimizing the negative log-likelihood (NLL) of the sequence, which is done by truncated backpropogation.

NLL=-\sum _{t=1}^{T}\log P(w_{t}|w_{1:t-1})

Character-level Convolutional Neural Network

Let $C$ be the vocabulary of characters, $d$ be the dimensionality of character embeddings, and $Q\in R^{d\times |C|}$ be the matrix of character embeddings. Suppose that word $k\in V$ is made up of a sequence of characters $[c_{1},...,c_{l}]$ , where $l$ is the length of word $k$ . Then the character-level representation of $k$ is given by the matrix $C^{k}\in R^{d\times l}$ , where the $j$ th column corresponds to the character embedding for $c_{j}$ .

A convolution between $C^{k}$ and a filter (or kernel) $H\in R^{d\times w}$ of width $w$ is applied, after which a bias is added followed by a nonlinearity to obtain a feature map $f_{k}\in R^{l-w+1}$ . The $i$ th element of $f_{k}$ is:

Figure 4: First two rows are from Botha (2014) while the last six are from the current paper being explained. KN-4 is a Kneser- Ney 4-gram language model, and MLBL is the best performing morphological logbilinear model from Botha (2014). Small/Large refers to model size and Word/Morph/Char are models with words/morphemes/characters as inputs respectively. (Source: ^[1])

f^{k}[i]=tanh(<C^{k}[*,i:i+w-1],H>+b)

where $C^{k}[*,i:i+w-1]$ is the $i$ -to- $(i+w-1)$ -th column of $C_{k}$ and $<A,B>=Tr(AB^{T})$ is the Frobenius inner product. Finally, take the max-over-time:

y^{k}=max_{i}f^{k}[i]

as the feature corresponding to the filter $H$ (when applied to word $k$ ). The idea, the authors say, is to capture the most important feature for a given filter. "A filter is essentially picking out a character n-gram, where the size of the n-gram corresponds to the filter width". Thus the framework uses multiple filters of varying widths to obtain the feature vector for $k$ . So if a total of $h$ filters $H_{1},...,H_{h}$ are used, then $yk=[y_{1}^{k},...,y_{h}^{k}]$ is the input representation of $k$ .

Highway Network

Highway network, recently proposed in ^[8], have the following function:

Figure 5: Nearest neighbor words (based on cosine similarity) of word representations from the large word-level and character-level (before and after highway layers) models trained on the PTB. (Source: ^[1])

z=t\odot g(W_{H}y+b_{H})+(1-t)\odot y

where $g$ is a nonlinearity, $t=\sigma (W_{T}y+b_{T})$ is called the transform gate, and $(1-t)$ is called the carry gate. Similar to the memory cells in LSTM networks, highway layers allow for training of deep networks by carrying some dimensions of the input directly to the output.

Figure 6: Plot of character n-gram representations via PCA for English. Colors correspond to prefixes (red), suffixes (blue), hyphenated (orange), and all others (grey). (Source: ^[1])

The overall working of the model can be observed in Figure 1. Essentially the character level CNN applies convolutions on the character embeddings with multiple filters and max pools from these to get a fixed dimensional representation. This is then fed to the highway layer which helps in encoding semantic features which are not dependent on edit distance alone. The output of the highway layer is then fed into an LSTM that predicts the next word.

Evaluation

Perplexity (PPL) is used to evaluate the performance of the models. The perplexity of a model over a sequence $[w_{1},...,w_{T}]$ is given by:

PPL=exp({\frac {NLL}{T}})

where $NLL$ is calculated over the test set.

The optimal hyperparameters tuned on PTB and the model is then applied to various morphologically rich languages: Czech, German, French, Spanish, Russian, and Arabic.

Penn Treebank is a large annotated corpus consisting of over 4.5 million words of American English ^[9].Two versions of the model are trained by the authors to assess the trade-off between performance and size. As another baseline, two comparable LSTM models that use word embeddings (LSTM-Word-Small, LSTM-Word-Large) are also trained. The large model presented in the paper is on par with the existing state-of-the-art (Zaremba et al. 2014), despite having approximately 60% fewer parameters. The small model significantly outperforms other NLMs of similar size as can be observed in Figure 3. English is relatively simple from a morphological standpoint, and thus the next set of results and also the main contributions of this paper (as claimed by the authors) are focused on languages with richer morphology. Results are compared against the morphological log-bilinear (MLBL) model from ^[5], which takes into account subword information through morpheme embeddings. On DATA-S it is clear from Figure 4 that the character-level models outperform their word-level counterparts despite being smaller. The character models also outperform their morphological counterparts (both MLBL and LSTM architectures).

Discussion

Learned Word Representations

Observing Figure 5, before the highway layers, the nearest neighbors of you are your, young, four, youth, which is close to you in terms of edit distance. The highway layers, however, seem to enable encoding of semantic features that are not derivable from distance alone. After highway layers, the nearest neighbor of you is we, which is orthographically distinct from you. The model also makes some clear mistakes (e.g. his and hhs), which is a drawback of this approach. The authors hypothesize that highway networks are especially well-suited to work with CNN's, adaptively combining local features detected by the individual filters.

Learned Character N-gram Representations

Each filter of the CharCNN is essentially learning to detect particular character n-grams. The initial expectation would be that each filter would learn to activate on different morphemes and then build up semantic representations of words from the identified morphemes. However, upon reviewing the character n-grams picked up by the filters, the authors found that they did not (in general) correspond to valid morphemes. The learned representations of all character n-grams are plotted via principal components analysis. Each character n-gram is fed into the CharCNN and the CharCNN’s output is used as the fixed dimensional representation for the corresponding character n-gram. From Figure 6, the model learns to differentiate between prefixes (red), suffixes (blue), and others (grey). They also find that the representations are particularly sensitive to character n-grams containing hyphens (orange).

Conclusion

The work introduces a neural language model that utilizes only character-level inputs. Predictions are still made at the word-level. Despite having fewer parameters, the model outperforms baseline models that utilize word/morpheme embeddings in the input layer. The work questions the necessity of word embeddings as inputs for neural language modeling.
Analysis of word representations obtained from the character composition part of the model further indicates that the model is able to encode, from characters only, rich semantic and orthographic features.
The model requires additional convolution operations over characters and is thus slower than a comparable word-level model which can perform a simple lookup at the input layer but is manageable with optimized GPU implementations.