This page primarily follows  and .
Principal Author: Kevin Dsouza
This page covers character level language models implemented using Long-Short-Term-Memory-Networks (LSTMs). The first half introduces a Character-Aware neural model and the second half builds on this idea to follow a hierarchical structure in building character level models.
This page builds on Recurrent Neural Networks and Natural Language Processing.
Character level language models depart from Word level language models and come under the general category of Language models.
Traditional methods in language modeling involve making an n-th order Markov assumption and estimating n-gram probabilities via counting. The count-based models are simple to train, but due to data sparsity, the probabilities of rare n-grams can be poorly estimated. Neural Language Models (NLM) address the issue of n-gram data sparsity by utilizing word embeddings . These word embeddings derived from NLMs exhibit the property whereby semantically close words are close in the induced vector space. Even though NLMs outperform count-based n-gram language models , they are oblivious to subword information (e.g. morphemes). Embeddings of rare words can thus be poorly estimated, leading to high perplexities (Perplexity is the measure of how well a probability distribution predicts a sample) which is especially problematic in morphologically rich languages.
In this work, the authors propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNN-LM). Unlike previous works that utilize subword information via morphemes , this model does not require morphological tagging as a pre-processing step.
Long short-term memory (LSTM)  addresses the problem of learning long-range dependencies in the Recurrent Neural Networks by adding a memory cell vector at each time step. One step of an LSTM takes as input (input vector at time ), (hidden state vector at ), (memory cell vector at time ) and produces (hidden state vector at time ), (memory cell vector at time ) via the following intermediate calculations:
Here and are the element-wise sigmoid and hyperbolic tangent functions, is the element-wise multiplication operator, and , , are referred to as input, forget, and output gates. At , and are initialized to zero vectors. Parameters of the LSTM are for .
Memory cells in the LSTM are additive with respect to time, alleviating the vanishing gradient problem. This is the most important distinction of LSTM in the sense that it allows for an uninterrupted gradient flow through the memory cell. Gradient exploding is still an issue, though in practice simple gradient clipping works well. LSTMs have outperformed vanilla RNNs on many tasks, including on language modeling .
Let be the fixed size vocabulary of words. A language model specifies a distribution over (whose support is ) given the historical sequence . A recurrent neural network language model (RNN-LM) applyies an affine transformation to the hidden layer followed by a softmax:
where is the th column of (output embedding), and is a bias term. If are the sequence of words in the training corpus, training involves minimizing the negative log-likelihood (NLL) of the sequence, which is done by truncated backpropogation.
Let be the vocabulary of characters, be the dimensionality of character embeddings, and be the matrix of character embeddings. Suppose that word is made up of a sequence of characters , where is the length of word . Then the character-level representation of is given by the matrix , where the th column corresponds to the character embedding for .
A convolution between and a filter (or kernel) of width is applied, after which a bias is added followed by a nonlinearity to obtain a feature map . The th element of is:
where is the -to--th column of and is the Frobenius inner product. Finally, take the max-over-time:
as the feature corresponding to the filter (when applied to word ). The idea, the authors say, is to capture the most important feature for a given filter. "A filter is essentially picking out a character n-gram, where the size of the n-gram corresponds to the filter width". Thus the framework uses multiple filters of varying widths to obtain the feature vector for . So if a total of filters are used, then is the input representation of .
Highway network, recently proposed in , have the following function:
where is a nonlinearity, is called the transform gate, and is called the carry gate. Similar to the memory cells in LSTM networks, highway layers allow for training of deep networks by carrying some dimensions of the input directly to the output.
The overall working of the model can be observed in Figure 1. Essentially the character level CNN applies convolutions on the character embeddings with multiple filters and max pools from these to get a fixed dimensional representation. This is then fed to the highway layer which helps in encoding semantic features which are not dependent on edit distance alone. The output of the highway layer is then fed into an LSTM that predicts the next word.
Perplexity (PPL) is used to evaluate the performance of the models. The perplexity of a model over a sequence is given by:
where is calculated over the test set.
The optimal hyperparameters tuned on PTB and the model is then applied to various morphologically rich languages: Czech, German, French, Spanish, Russian, and Arabic.
Penn Treebank is a large annotated corpus consisting of over 4.5 million words of American English .Two versions of the model are trained by the authors to assess the trade-off between performance and size. As another baseline, two comparable LSTM models that use word embeddings (LSTM-Word-Small, LSTM-Word-Large) are also trained. The large model presented in the paper is on par with the existing state-of-the-art (Zaremba et al. 2014), despite having approximately 60% fewer parameters. The small model significantly outperforms other NLMs of similar size as can be observed in Figure 3. English is relatively simple from a morphological standpoint, and thus the next set of results and also the main contributions of this paper (as claimed by the authors) are focused on languages with richer morphology. Results are compared against the morphological log-bilinear (MLBL) model from , which takes into account subword information through morpheme embeddings. On DATA-S it is clear from Figure 4 that the character-level models outperform their word-level counterparts despite being smaller. The character models also outperform their morphological counterparts (both MLBL and LSTM architectures).
Observing Figure 5, before the highway layers, the nearest neighbors of you are your, young, four, youth, which is close to you in terms of edit distance. The highway layers, however, seem to enable encoding of semantic features that are not derivable from distance alone. After highway layers, the nearest neighbor of you is we, which is orthographically distinct from you. The model also makes some clear mistakes (e.g. his and hhs), which is a drawback of this approach. The authors hypothesize that highway networks are especially well-suited to work with CNN's, adaptively combining local features detected by the individual filters.
Each filter of the CharCNN is essentially learning to detect particular character n-grams. The initial expectation would be that each filter would learn to activate on different morphemes and then build up semantic representations of words from the identified morphemes. However, upon reviewing the character n-grams picked up by the filters, the authors found that they did not (in general) correspond to valid morphemes. The learned representations of all character n-grams are plotted via principal components analysis. Each character n-gram is fed into the CharCNN and the CharCNN’s output is used as the fixed dimensional representation for the corresponding character n-gram. From Figure 6, the model learns to differentiate between prefixes (red), suffixes (blue), and others (grey). They also find that the representations are particularly sensitive to character n-grams containing hyphens (orange).
The previous approach considered character level inputs and word level outputs. These give state of the art performance but still output the probability distribution over words and also don't completely handle out of vocabulary instances. Also, word level embeddings need to be stored in memory to compute cross entropy loss for this model. The motivation behind this work is to consider characters as both the inputs and the outputs to handle rich morphology in a better way. The problem with this is that Character Level Models (CLM) have to consider a longer sequence of history tokens to predict the next token than the Word Level Models (WLM), due to the smaller unit of tokens.
As shown in Figure 7, the RNN is trained to predict the next character by minimizing the cross-entropy loss of the softmax output that represents the probability distributions of the next character. One of the most successful approaches to understand character level inputs is to encode the arbitrary character sequence to a word embedding, and feed this vector to the word-level RNN LMs. The previously discussed work uses CNN to generate word embeddings and achieves the state of the art results on English Penn Treebank corpus. Some works also use Bidirectional LSTMs  instead of CNN's to generate these embeddings. However, in all of these approaches, LMs still generate the output probabilities at the word- level.
The approach in this work is different from the above ones in many ways.
The LSTM equations previously introduced can be generalized and extended to support clocks and reset functions. The equations can be generalized by setting and . Any generalized RNNs can be converted to the ones that incorporate an external clock signal, , as:
where is 0 or 1. The RNN updates its state and output only when = 1. Otherwise, when = 0, the state and output values remain the same as those of the previous step. The reset of RNNs is performed by setting to 0. Specifically the above equation becomes:
where the reset signal = 0 or 1. When = 1, the RNN forgets the previous contexts. If the original RNN equations are differentiable, the extended equations with clock and reset signals are also differentiable.
The hierarchical RNN (HRNN) architectures proposed in this paper have several RNN modules with different clock rates as depicted in Figure 8.
For character-level language modeling, a two-level () HRNN is used in the paper by letting be a character-level module and be a word-level module. The word-level module is clocked at the word boundary input, . The input and softmax output layer is connected to the character-level module, and the current word boundary token ( or ) information is given to the word-level module.
Two types of two-level HRNN CLM architectures are proposed. As shown in Figure 9, both models have two LSTM layers per submodule. In the HLSTM-A architecture, both LSTM layers in the character-level module receive one-hot encoded character input. Hence, the second layer of the character-level module is conditioned by the context vector. Contrastively, in HLSTM-B, the second LSTM layer of the character-level module does not receive the character inputs but a word embedding from the first LSTM layer. The experimental results conducted by the authors show that HLSTM-B is more efficient for CLM applications. The model is trained to generate the context vector that contains useful information about the probability distribution of the next word.
The models are compared with other WLMs in literature in terms of word-level perplexity (PPL). The word-level PPL of the models is directly converted from bits-per-character (BPC), which is the standard performance measure for CLMs, as follows:
where and are the number of characters and words in a test set, respectively.
The Wall Street Journal (WSJ) corpus is designed for training and benchmarking automatic speech recognition systems. Figure 10 shows the perplexities of traditional mono-clock deep LSTM and HLSTM based CLMs. The size means that the network consists of LSTM layers, where each layer contains memory cells. The HLSTM models show better perplexity performances even when the number of LSTM cells or parameters is much smaller than that of the deep LSTM networks. " It is important to reset the character-level modules at the word- level clocks for helping the character-level modules to better concentrate on the short-term information ". As observed in Figure 10 and pointed by the authors, removing the reset functionality of the character-level module of the HLSTM-B model results in degraded performance.
The perplexities of WLMs in the literature are presented in Figure 11. The Kneser-Ney (KN) smoothed 5-gram model (KN-5) is a strong non-neural WLM baseline and all HLSTM models in Figure 11 show better perplexities than KN-5 does.
The proposed CLMs are applied to the end-to-end automatic speech recognition (ASR). The CLMs are trained with WSJ training data. Unlike WLMs, the proposed CLMs have a very small number of parameters, so they can be employed for real-time character-level beam search.
The results are summarized in Figure 12. It is observed that the perplexity of LM and the word error rate (WER) have a strong correlation as observed by the authors. As shown in the table, a better WER can be achieved by replacing the traditional deep LSTM (4x1024) CLM with the proposed HLSTM-B (4x512) CLM, while reducing the number of LM parameters to 30%.
Although the character level language models explored here do a good job of handling rich morphology in languages, diversity in representation is missing. Frequent n-grams in the training set will result in the model overfitting and will produce rigid results. Recently variational frameworks for language modeling have been investigated to explore this issue. Also, the models discussed in this page don't take into account the global sentence context and only operate on the local information. Thus, the variational frameworks coupled with smart attention mechanisms can result in a model that produces diverse representations with global sentence context. In the final page, we will explore such a framework that can generate diverse sentences from global sentence representations. A hierarchical form of the Variational Autoencoder will be studied in an attempt to analyze the effect of hierarchy in the posterior and the prior on representation.
All the pictures on this page are borrowed from  and .