# Course:CPSC522/Character Level Language Models using LSTM

## Character Level Language Models using LSTM

Principal Author: Kevin Dsouza
Collaborators:

## Abstract

This page covers character level language models implemented using Long-Short-Term-Memory-Networks (LSTMs). The first half introduces a Character-Aware neural model and the second half builds on this idea to follow a hierarchical structure in building character level models.

### Related Pages

Character level language models depart from Word level language models and come under the general category of Language models.

## Paper 1: Character-Aware Neural Language Models

### Introduction

Figure 1: The overall working of the model. (Source: [1])

Traditional methods in language modeling involve making an n-th order Markov assumption and estimating n-gram probabilities via counting. The count-based models are simple to train, but due to data sparsity, the probabilities of rare n-grams can be poorly estimated. Neural Language Models (NLM) address the issue of n-gram data sparsity by utilizing word embeddings [3]. These word embeddings derived from NLMs exhibit the property whereby semantically close words are close in the induced vector space. Even though NLMs outperform count-based n-gram language models [4], they are oblivious to subword information (e.g. morphemes). Embeddings of rare words can thus be poorly estimated, leading to high perplexities (Perplexity is the measure of how well a probability distribution predicts a sample) which is especially problematic in morphologically rich languages.

In this work, the authors propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNN-LM). Unlike previous works that utilize subword information via morphemes [5], this model does not require morphological tagging as a pre-processing step.

### Long-Short-Term-Memory

Long short-term memory (LSTM) [6] addresses the problem of learning long-range dependencies in the Recurrent Neural Networks by adding a memory cell vector ${\displaystyle c_{t}}$ ${\displaystyle \in }$ ${\displaystyle R^{n}}$ at each time step. One step of an LSTM takes as input ${\displaystyle x_{t}}$ (input vector at time ${\displaystyle t}$), ${\displaystyle h_{t-1}}$ (hidden state vector at ${\displaystyle t-1}$), ${\displaystyle c_{t-1}}$ (memory cell vector at time ${\displaystyle t-1}$) and produces ${\displaystyle h_{t}}$ (hidden state vector at time ${\displaystyle t}$), ${\displaystyle c_{t}}$ (memory cell vector at time ${\displaystyle t}$) via the following intermediate calculations:

${\displaystyle i_{t}=\sigma (W^{i}x_{t}+U^{i}h_{t-1}+b^{i})}$
${\displaystyle f_{t}=\sigma (W^{f}x_{t}+U^{f}h_{t-1}+b^{f})}$
${\displaystyle o_{t}=\sigma (W^{o}x_{t}+U^{o}h_{t-1}+b^{o})}$
${\displaystyle g_{t}=tanh(W^{g}x_{t}+U^{g}h_{t-1}+b^{g})}$
${\displaystyle c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot g_{t}}$
${\displaystyle h_{t}=o_{t}\odot tanh(c_{t})}$

Here ${\displaystyle \sigma ()}$ and ${\displaystyle tanh()}$ are the element-wise sigmoid and hyperbolic tangent functions, ${\displaystyle \odot }$ is the element-wise multiplication operator, and ${\displaystyle i_{t}}$, ${\displaystyle f_{t}}$, ${\displaystyle o_{t}}$ are referred to as input, forget, and output gates. At ${\displaystyle t=1}$, ${\displaystyle h_{0}}$ and ${\displaystyle c_{0}}$ are initialized to zero vectors. Parameters of the LSTM are ${\displaystyle W_{j},U_{j},b_{j}}$ for ${\displaystyle j\in (i,f,o,g)}$.

Memory cells in the LSTM are additive with respect to time, alleviating the vanishing gradient problem. This is the most important distinction of LSTM in the sense that it allows for an uninterrupted gradient flow through the memory cell. Gradient exploding is still an issue, though in practice simple gradient clipping works well. LSTMs have outperformed vanilla RNNs on many tasks, including on language modeling [7].

### Recurrent Neural Network Language model (RNNLM)

Let ${\displaystyle V}$ be the fixed size vocabulary of words. A language model specifies a distribution over ${\displaystyle w_{t+1}}$ (whose support is ${\displaystyle V}$) given the historical sequence ${\displaystyle w_{1:t}=[w_{1},...,w_{t}]}$. A recurrent neural network language model (RNN-LM) applyies an affine transformation to the hidden layer followed by a softmax:

${\displaystyle P(w_{t+1}=j|w_{1:t})={\frac {exp(h_{t}p^{j}+q^{j})}{\sum _{k\in V}exp(h_{t}p^{k}+q^{k})}}}$
Figure 3: Performance of the model versus other neural language models on the English Penn Treebank test set. (Source: [1])

where ${\displaystyle p_{j}}$ is the ${\displaystyle j}$th column of ${\displaystyle P\in R^{m\times |V|}}$ (output embedding), and ${\displaystyle q_{j}}$ is a bias term. If ${\displaystyle w_{1:T}=[w_{1},...,w_{T}]}$ are the sequence of words in the training corpus, training involves minimizing the negative log-likelihood (NLL) of the sequence, which is done by truncated backpropogation.

${\displaystyle NLL=-\sum _{t=1}^{T}\log P(w_{t}|w_{1:t-1})}$

### Character-level Convolutional Neural Network

Let ${\displaystyle C}$ be the vocabulary of characters, ${\displaystyle d}$ be the dimensionality of character embeddings, and ${\displaystyle Q\in R^{d\times |C|}}$ be the matrix of character embeddings. Suppose that word ${\displaystyle k\in V}$ is made up of a sequence of characters ${\displaystyle [c_{1},...,c_{l}]}$, where ${\displaystyle l}$ is the length of word ${\displaystyle k}$. Then the character-level representation of ${\displaystyle k}$ is given by the matrix ${\displaystyle C^{k}\in R^{d\times l}}$, where the ${\displaystyle j}$th column corresponds to the character embedding for ${\displaystyle c_{j}}$.

A convolution between ${\displaystyle C^{k}}$ and a filter (or kernel) ${\displaystyle H\in R^{d\times w}}$ of width ${\displaystyle w}$ is applied, after which a bias is added followed by a nonlinearity to obtain a feature map ${\displaystyle f_{k}\in R^{l-w+1}}$. The ${\displaystyle i}$th element of ${\displaystyle f_{k}}$ is:

Figure 4: First two rows are from Botha (2014) while the last six are from the current paper being explained. KN-4 is a Kneser- Ney 4-gram language model, and MLBL is the best performing morphological logbilinear model from Botha (2014). Small/Large refers to model size and Word/Morph/Char are models with words/morphemes/characters as inputs respectively. (Source: [1])
${\displaystyle f^{k}[i]=tanh(+b)}$

where ${\displaystyle C^{k}[*,i:i+w-1]}$ is the ${\displaystyle i}$-to-${\displaystyle (i+w-1)}$-th column of ${\displaystyle C_{k}}$ and ${\displaystyle =Tr(AB^{T})}$ is the Frobenius inner product. Finally, take the max-over-time:

${\displaystyle y^{k}=max_{i}f^{k}[i]}$

as the feature corresponding to the filter ${\displaystyle H}$ (when applied to word ${\displaystyle k}$). The idea, the authors say, is to capture the most important feature for a given filter. "A filter is essentially picking out a character n-gram, where the size of the n-gram corresponds to the filter width". Thus the framework uses multiple filters of varying widths to obtain the feature vector for ${\displaystyle k}$. So if a total of ${\displaystyle h}$ filters ${\displaystyle H_{1},...,H_{h}}$ are used, then ${\displaystyle yk=[y_{1}^{k},...,y_{h}^{k}]}$ is the input representation of ${\displaystyle k}$.

### Highway Network

Highway network, recently proposed in [8], have the following function:

Figure 5: Nearest neighbor words (based on cosine similarity) of word representations from the large word-level and character-level (before and after highway layers) models trained on the PTB. (Source: [1])
${\displaystyle z=t\odot g(W_{H}y+b_{H})+(1-t)\odot y}$

where ${\displaystyle g}$ is a nonlinearity, ${\displaystyle t=\sigma (W_{T}y+b_{T})}$ is called the transform gate, and ${\displaystyle (1-t)}$ is called the carry gate. Similar to the memory cells in LSTM networks, highway layers allow for training of deep networks by carrying some dimensions of the input directly to the output.

Figure 6: Plot of character n-gram representations via PCA for English. Colors correspond to prefixes (red), suffixes (blue), hyphenated (orange), and all others (grey). (Source: [1])

The overall working of the model can be observed in Figure 1. Essentially the character level CNN applies convolutions on the character embeddings with multiple filters and max pools from these to get a fixed dimensional representation. This is then fed to the highway layer which helps in encoding semantic features which are not dependent on edit distance alone. The output of the highway layer is then fed into an LSTM that predicts the next word.

### Evaluation

Perplexity (PPL) is used to evaluate the performance of the models. The perplexity of a model over a sequence ${\displaystyle [w_{1},...,w_{T}]}$ is given by:

${\displaystyle PPL=exp({\frac {NLL}{T}})}$

where ${\displaystyle NLL}$ is calculated over the test set.

The optimal hyperparameters tuned on PTB and the model is then applied to various morphologically rich languages: Czech, German, French, Spanish, Russian, and Arabic.

Penn Treebank is a large annotated corpus consisting of over 4.5 million words of American English [9].Two versions of the model are trained by the authors to assess the trade-off between performance and size. As another baseline, two comparable LSTM models that use word embeddings (LSTM-Word-Small, LSTM-Word-Large) are also trained. The large model presented in the paper is on par with the existing state-of-the-art (Zaremba et al. 2014), despite having approximately 60% fewer parameters. The small model significantly outperforms other NLMs of similar size as can be observed in Figure 3. English is relatively simple from a morphological standpoint, and thus the next set of results and also the main contributions of this paper (as claimed by the authors) are focused on languages with richer morphology. Results are compared against the morphological log-bilinear (MLBL) model from [5], which takes into account subword information through morpheme embeddings. On DATA-S it is clear from Figure 4 that the character-level models outperform their word-level counterparts despite being smaller. The character models also outperform their morphological counterparts (both MLBL and LSTM architectures).

### Discussion

#### Learned Word Representations

Observing Figure 5, before the highway layers, the nearest neighbors of you are your, young, four, youth, which is close to you in terms of edit distance. The highway layers, however, seem to enable encoding of semantic features that are not derivable from distance alone. After highway layers, the nearest neighbor of you is we, which is orthographically distinct from you. The model also makes some clear mistakes (e.g. his and hhs), which is a drawback of this approach. The authors hypothesize that highway networks are especially well-suited to work with CNN's, adaptively combining local features detected by the individual filters.

#### Learned Character N-gram Representations

Each filter of the CharCNN is essentially learning to detect particular character n-grams. The initial expectation would be that each filter would learn to activate on different morphemes and then build up semantic representations of words from the identified morphemes. However, upon reviewing the character n-grams picked up by the filters, the authors found that they did not (in general) correspond to valid morphemes. The learned representations of all character n-grams are plotted via principal components analysis. Each character n-gram is fed into the CharCNN and the CharCNN’s output is used as the fixed dimensional representation for the corresponding character n-gram. From Figure 6, the model learns to differentiate between prefixes (red), suffixes (blue), and others (grey). They also find that the representations are particularly sensitive to character n-grams containing hyphens (orange).

### Conclusion

1. The work introduces a neural language model that utilizes only character-level inputs. Predictions are still made at the word-level. Despite having fewer parameters, the model outperforms baseline models that utilize word/morpheme embeddings in the input layer. The work questions the necessity of word embeddings as inputs for neural language modeling.
2. Analysis of word representations obtained from the character composition part of the model further indicates that the model is able to encode, from characters only, rich semantic and orthographic features.
3. The model requires additional convolution operations over characters and is thus slower than a comparable word-level model which can perform a simple lookup at the input layer but is manageable with optimized GPU implementations.

## Paper 2: Character-Level Language Modelling With Hierarchical Recurrent Neural Networks

Figure 7: Training an RNN-based CLM. (source:[2])

### Introduction

The previous approach considered character level inputs and word level outputs. These give state of the art performance but still output the probability distribution over words and also don't completely handle out of vocabulary instances. Also, word level embeddings need to be stored in memory to compute cross entropy loss for this model. The motivation behind this work is to consider characters as both the inputs and the outputs to handle rich morphology in a better way. The problem with this is that Character Level Models (CLM) have to consider a longer sequence of history tokens to predict the next token than the Word Level Models (WLM), due to the smaller unit of tokens.

### Character-aware word-level language modeling

As shown in Figure 7, the RNN is trained to predict the next character ${\displaystyle x_{t+1}}$ by minimizing the cross-entropy loss of the softmax output that represents the probability distributions of the next character. One of the most successful approaches to understand character level inputs is to encode the arbitrary character sequence to a word embedding, and feed this vector to the word-level RNN LMs. The previously discussed work uses CNN to generate word embeddings and achieves the state of the art results on English Penn Treebank corpus. Some works also use Bidirectional LSTMs [10] instead of CNN's to generate these embeddings. However, in all of these approaches, LMs still generate the output probabilities at the word- level.

The approach in this work is different from the above ones in many ways.

1. First, the base model is the character-level RNN LMs, instead of WLMs, and is extended to consider long-term contexts. Therefore, the output probabilities are generated with character-level clocks. This property the authors claim is extremely useful for character-level beam search for end-to-end speech recognition.
2. In this work, athe authors propose hierarchical RNN based LMs that combine the advantageous characteristics of both character and word-level LMs. The proposed network consists of a low-level and a high-level RNN. The low-level RNN employs the character-level input and output and provides the short-term embedding to the high-level RNN that operates as the word-level RNN.
3. This hierarchical LM can be extended for processing a longer period of information, such as sentences, topics, or other contexts.

### LSTMs With External Clock and Reset Signals

Figure 8: Hierarchical RNN (HRNN). (source:[2])

The LSTM equations previously introduced can be generalized and extended to support clocks and reset functions. The equations can be generalized by setting ${\displaystyle s_{t}=[c_{t},h_{t}]}$ and ${\displaystyle y_{t}=h_{t}}$. Any generalized RNNs can be converted to the ones that incorporate an external clock signal, ${\displaystyle c_{t}}$, as:

${\displaystyle s_{t}=(1-c_{t})s_{t-1}+c_{t}f(x_{t},s_{t-1})}$
${\displaystyle y_{t}=g(s_{t})}$

where ${\displaystyle c_{t}}$ is 0 or 1. The RNN updates its state and output only when ${\displaystyle c_{t}}$ = 1. Otherwise, when ${\displaystyle c_{t}}$ = 0, the state and output values remain the same as those of the previous step. The reset of RNNs is performed by setting ${\displaystyle s_{t-1}}$ to 0. Specifically the above equation becomes:

${\displaystyle s_{t}=(1-c_{t})(1-r_{t})s_{t-1}+c_{t}f(x_{t},(1-r_{t})s_{t-1})}$

where the reset signal ${\displaystyle r_{t}}$ = 0 or 1. When ${\displaystyle r_{t}}$ = 1, the RNN forgets the previous contexts. If the original RNN equations are differentiable, the extended equations with clock and reset signals are also differentiable.

### Hierarchical RNN

Figure 9: Two-level hierarchical LSTM (HLSTM) structures for CLMs. (source:[2])

The hierarchical RNN (HRNN) architectures proposed in this paper have several RNN modules with different clock rates as depicted in Figure 8.

For character-level language modeling, a two-level (${\displaystyle L=2}$) HRNN is used in the paper by letting ${\displaystyle l=1}$ be a character-level module and ${\displaystyle l=2}$ be a word-level module. The word-level module is clocked at the word boundary input, ${\displaystyle }$. The input and softmax output layer is connected to the character-level module, and the current word boundary token (${\displaystyle }$ or ${\displaystyle }$) information is given to the word-level module.

Two types of two-level HRNN CLM architectures are proposed. As shown in Figure 9, both models have two LSTM layers per submodule. In the HLSTM-A architecture, both LSTM layers in the character-level module receive one-hot encoded character input. Hence, the second layer of the character-level module is conditioned by the context vector. Contrastively, in HLSTM-B, the second LSTM layer of the character-level module does not receive the character inputs but a word embedding from the first LSTM layer. The experimental results conducted by the authors show that HLSTM-B is more efficient for CLM applications. The model is trained to generate the context vector that contains useful information about the probability distribution of the next word.

### Evaluation

Figure 10: Perplexities of CLMs on the WSJ corpus. (source:[2])

The models are compared with other WLMs in literature in terms of word-level perplexity (PPL). The word-level PPL of the models is directly converted from bits-per-character (BPC), which is the standard performance measure for CLMs, as follows:

${\displaystyle PPL=2^{BPC{\frac {N_{c}}{N_{w}}}}}$

where ${\displaystyle N_{c}}$ and ${\displaystyle N_{w}}$ are the number of characters and words in a test set, respectively.

#### Wall Street Journal (WSJ) Corpus

Figure 11: Perplexities of WLMs on the WSJ corpus in the literature. (source:[2])

The Wall Street Journal (WSJ) corpus is designed for training and benchmarking automatic speech recognition systems. Figure 10 shows the perplexities of traditional mono-clock deep LSTM and HLSTM based CLMs. The size ${\displaystyle N\times M}$ means that the network consists of ${\displaystyle N}$ LSTM layers, where each layer contains ${\displaystyle M}$ memory cells. The HLSTM models show better perplexity performances even when the number of LSTM cells or parameters is much smaller than that of the deep LSTM networks. " It is important to reset the character-level modules at the word- level clocks for helping the character-level modules to better concentrate on the short-term information ". As observed in Figure 10 and pointed by the authors, removing the reset functionality of the character-level module of the HLSTM-B model results in degraded performance.

The perplexities of WLMs in the literature are presented in Figure 11. The Kneser-Ney (KN) smoothed 5-gram model (KN-5) is a strong non-neural WLM baseline and all HLSTM models in Figure 11 show better perplexities than KN-5 does.

#### End-to-end automatic speech recognition (ASR)

Figure 12: End-to-end ASR results on the WSJ evaluation set. (source:[2])

The proposed CLMs are applied to the end-to-end automatic speech recognition (ASR). The CLMs are trained with WSJ training data. Unlike WLMs, the proposed CLMs have a very small number of parameters, so they can be employed for real-time character-level beam search.

The results are summarized in Figure 12. It is observed that the perplexity of LM and the word error rate (WER) have a strong correlation as observed by the authors. As shown in the table, a better WER can be achieved by replacing the traditional deep LSTM (4x1024) CLM with the proposed HLSTM-B (4x512) CLM, while reducing the number of LM parameters to 30%.

### Conclusion

1. In this paper, hierarchical RNN (HRNN) based CLMs are proposed. The HRNN consists of several submodules with different clock rates. Therefore, it is capable of learning long-term dependencies as well as short-term details.
2. As shown in the WSJ speech recognition example, the proposed model can be employed for the real-time speech recognition with less than 10 million parameters.
3. Also, CLMs can handle OOV words by nature, which is a great advantage for the end-to-end speech recognition and many NLP tasks.

Although the character level language models explored here do a good job of handling rich morphology in languages, diversity in representation is missing. Frequent n-grams in the training set will result in the model overfitting and will produce rigid results. Recently variational frameworks for language modeling have been investigated to explore this issue. Also, the models discussed in this page don't take into account the global sentence context and only operate on the local information. Thus, the variational frameworks coupled with smart attention mechanisms can result in a model that produces diverse representations with global sentence context. In the final page, we will explore such a framework that can generate diverse sentences from global sentence representations. A hierarchical form of the Variational Autoencoder will be studied in an attempt to analyze the effect of hierarchy in the posterior and the prior on representation.

All the pictures on this page are borrowed from [1] and [2].

## Annotated Bibliography

1. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016, February), "Character-Aware Neural Language Models", In AAAI, pp. 2741-2749. 2016.
2. Hwang, K., & Sung, W. (2017, March), "Character-level language modeling with hierarchical recurrent neural networks.", In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 5720-5724). IEEE.
3. Mikolov, T.; Karafiat, M.; Burget, L.; Cernocky, J.; and Khudanpur, S. 2010, "Recurrent Neural Network Based Language Model.", In Proceedings of INTERSPEECH.
4. Mikolov, T.; Deoras, A.; Kombrink, S.; Burget, L.; and Cernocky, J. 2011, "Empirical Evaluation and Combination of Advanced Language Modeling Techniques.", In Proceedings of INTERSPEECH.
5. Botha, J., and Blunsom, P. 2014, "Compositional Morphology for Word Representations and Language Modelling.", In Proceedings of ICML.
6. Hochreiter, S., and Schmidhuber, J. 1997, "Long Short-Term Memory.", Neural Computation 9:1735–1780.
7. Sundermeyer, M.; Schluter, R.; and Ney, H. 2012, "LSTM Neural Networks for Language Modeling."
8. Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015, "Training Very Deep Networks."
9. Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini., 1993, "Building a large annotated corpus of English: The Penn Treebank.", Computational linguistics 19.2 (1993): 313-330.
10. Yasumasa Miyamoto and Kyunghyun Cho, 2016, "Gated word- character recurrent language model.", Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1992–1997.