Structured Prediction part three - Training a linear-chain CRF
In this final part of the series on structured prediction with linear-chain CRFs we will use our implementation from part two
to train a model on real data.
To learn such a model, we need a dataset with examples consisting of input sentences annotated with POS tags.
We will choose the Universal Dependencies dataset (Silveira et al., 2014).
Then all the things we need to implement are:
A Vocabulary to convert from strings to numerical values for computational models.
A TaggingDataset to convert all our data to Tensors that can be processed by PyTorch.
A train() loop to train our CRF and feature-extractor end-to-end on data.
A test() loop to test a trained model on new data.
Additionally, we will slightly chance the Encoder and Tagger from part two to incorporate
a character-based model.
Imports
Let’s install and import the libraries we need (TorchNLP isn’t part of the default runtime in Google Colab,
which I used for the implementation):
!pip install torchnlp pytorch-nlp
To make sure that CUDA is in fact available (which is definitely nice and maybe even necessary for training
on the universal dependencies dataset), Google Colab offers sessions with a GPU! Select this in the runtime in the top-right
corner if you’re coding everything yourself.
The Vocabulary & Dataset
First, we’ll implement the vocabulary, which is a class that reads sentences as lists of strings, and
converts them to indices. Something pragmatical for sequence prediction with neural methods is that we often use an <UNK>-token.
In our training set, if a word occurs very infrequently, we probably cannot learn meaningful embeddings for it
and we can replace the occurrences of that word by <UNK>. This means that we will learn a kind of average embedding
for all infrequent words, and we can use this token again at test-time. At test-time there will inevitably be words
that don’t occur in the training set, and since we don’t have trained embeddings for those, they will map onto the <UNK>-token.
One other approach to deal with unknown words in the test data is using a character model. It’s unlikely that we
won’t encounter a certain character, so when a word is unknown, a character model can represent it more meaningfully.
One could imagine that at test time the model encounters and word it hasn’t seen at training time ending with “-ing”, the word model will represent it with
an unknown token <UNK>, but the character model might recognize that this is likely a verb. Additionally, if there are
misspelled words in the test data, the word vocabulary won’t know them, but the character model might recognize them.
Adding a character model makes our method more robust to noise in the data.
Note that adding a character model doesn’t
change anything about the way we desribed our implementation in part two of the series. We simply add a learned representation
to the representations we already had for the words, but this time character-based. The only change lies in
batching, which I’ll discuss below.
Both the code for the Vocabulary and the TaggingDataset below is very straightforward, so if you’re familiar with
these kind of methods just skip them and go to the part below where we look at the Universal Dependencies dataset.
We will use the above Vocabulary-class three times in the following, once for the input data consisting of words,
once for the input data processed character-by-character,
and once for the target data consisting of POS tags.
Before we implement the dataset class, let’s have a look at what
a batch looks like if we add characters. Everything becomes a bit more complicated in the code, because we want
a batched implementation of the forward pass. Recall that in part two of this series, we had a vector of features for
each word produced by the bidirectional LSTM. We now also want to add a vector that represents that same word broken up in characters. We denoted by
\(\mathbf{\bar{H}} \in \mathbb{R}^{m \times 2d_h}\) the hidden vectors for each input word in the sentence of length \(m\) as produced by the biLSTM.
We want to obtain some character-based features that have the exact same size, so we can add them.
Let’s denote the number of characters of the longest word in this sentence by \(k\) and define the character input by
\(\mathbf{x}_c \in \mathbb{R}^{m \times k \times |C|}\) (changing the definition of the input sequence of words from \(\mathbf{x}\) (in part two) to \(\mathbf{x}_w\)).
If we want to add these to the word features in a batched fashion
instead of a loop, we need to get a
vector that has the same size as the word features that the biLSTM returned. This means we need to have character sequences for the padding tokens
in our batch
as well. This feels a bit silly, because we will just be adding zeros to zeros, but it’s simply done so we can implement everything in
a batched fashion instead of with a loop. It also means we need to use the same hidden size for the bidirectional LSTM we’ll use for the characters. A batch that contains characters will have the following in tuples of examples:
Where \(\mathbf{x}_w\) is the input sequence in words, with \(|I|\) the size of the input vocabulary, \(\mathbf{x}_c\) is the input
sequence broken up in characters for each word, and \(\mathbf{y}\) the tag sequence with \(|S|\) the number of tags in our dataset.
To be able to add the character-based words to the regular words, we pad each sequence in the batch to the maximum \(m\) that occurs in the batch (both the sequence of words, and the sequence of character sequences),
and we pad each word to the maximum number of characters \(k\) appearing in the entire batch. See below a new graphical
depiction of a batch with batchsize \(B = 2\).
Then the next class to implement is the class that holds the TaggingDataset:
Let’s also adjust the encoder and tagger from part two to process the character sequences:
And the tagger:
The most important class, the ChainCRF, hasn’t changed from part two!
Now we’ll grab the UD dataset from the awesome
TorchNLP library, which has been made incredibly easy:
Let’s pass the data we want to our TaggingDataset and look at some stats. Now ud_dataset holds the training and test
set of Universal Dependencies, which are both lists of data points, where each data point is a dict with the tokens (AKA the input sequence),
the ud_tags (the UPOS tags), and the ptb_tags (the Penn Treebank tags). For our class we need to put these in a list
of tuples with the input sequence and target tags. We’ll choose as targets the ud_tags, since there are less classes for those
than the Penn Treebank tags, and the task is a bit easier.
As we can see above, we have 12543 training examples, 2077 testing examples, an input vocabulary of size 19674,
a character vocabulary of size 110, and 19 possible POS tags.
The most common input tokens are generally common tokens and their POS tags.
In the printed test example “What if Google <UNK> Into <UNK>?” we encounter two unknown words, which are the words
“Morphed” and “GoogleOS”, like we can see from the character representations. If we didn’t have those, the model wouldn’t have
any information about these words and could only guess a tag based on the other words in the input sequence, the predicted tag before
it if the CRF is used, and the average <UNK> embedding.
Training & Testing
We can now train our bi-LSTM-CRF on the Universal Dependencies training data! We’ll use Adam optimizer with parameters
taken from Ma & Hovy.
Below, each epoch loops over the entire dataset and puts each batch in the data through our model,
calculating the loss and taking a gradient step for the batch.
We’re going to choose a batch size of 100 as opposed to 10 in Ma & Hovy, because we want to train a bit faster and don’t
care too much about performance now. We anyway will never achieve the same performance as in Ma & Hovy, because they
use many more things that increase performance, most importantly probably a character-based CNN and pre-trained word embeddings.
If we were to optimize for performance we would do many more things than discussed here, some of which we’ll briefly discuss
below in the section Disclaimers.
We will also implement the testing loop so below we can immediately test our trained model on the unseen data.
Allright, now let’s initialize our model and train it for 15 epochs. First, let’s train a simple bi-LSTM, without
using the CRF.
The training of 15 epochs using a GPU in Google Colab takes about 10 minutes. The loss seems to steadily go down each epoch (although not a lot). We can test our model on the test data of the UD dataset and have
a look at the mean accuracy per example. The accuracy for one example is calculated (above in the test loop) as:
Where \(\mathbb{1}(\hat{y}_t, y_t)\) denotes the indicator function that equals 1 if the current predicted tag \(\hat{y}_t\) equals
the ground-truth target tag \(y_t\)
and 0 otherwise. The mean accuracy is then the mean of this metric over the entire testset.
So without the CRF, we already get a mean accuracy of 89%. That’s pretty good. Let’s see what we can do with the CRF enabled.
The training time is now rather about 20 minutes, but the loss goes down much more during the epochs here. Do note however that we cannot compare the values of the loss
between this model and the one without the CRF; they are using completely different loss functions. Let’s look at
the accuracy we get now.
That’s a pretty significant improvement! It seems like this test data can benefit from assuming dependencies in the
output sequence, which makes sense for the POS tagging task, as motivated before, helping for ambiguous words like “book” for example.
Let’s look some predictions of both models. Below a quick function to print some different examples.
Let’s print examples at index 1 and 2.
This example shows the benefits of the character model. Even though the sentence contains test words that are unknown because
they didn’t occur during training, the character model managed to predict the right tag for one of the unknown words.
This example shows the power of the CRF. For the last words, “of course”, the bi-LSTM model fails to see the
connection between “of” and “course” and between the tag “ADV” following “ADV” in that case, whereas the CRF handles
this situation correctly. There are many more examples like this where the CRF properly disambiguates words. Check it out
for yourself in the Google Colab!
Conclusion
We’re done! We derived, implemented, and trained a linear-chain CRF, showing that is gets significantly higher
test accuracy for a real-world dataset than a simple biLSTM model.
Disclaimers
There are many things that are actually good practice in deep learning, or things that might
improve performance, that we didn’t do here. For example, for every epoch we looped over that data in the same order,
making it not really SGD. We didn’t optimize at all for hyperparameters and randomly chose some things.
Dropout, learning rate decay, pre-trained embedings, etc.
Sources
Natalia Silveira and Timothy Dozat and Marie-Catherine de Marneffe and Samuel Bowman and
Miriam Connor and John Bauer and Christopher D. Manning (2014).
A Gold Standard Dependency Corpus for English