Basic Task Tutorial for Solving Text Classification Issues
3. Training the classifier
As mentioned earlier, the most effective way to train a classifier in NLP is to perform supervised fine-tuning of a pretrained encoder for the downstream task using domain-specific data.
The tokenized text is fed to a pre-trained encoder that returns an array of token embeddings. We get a separate embedding for each token.
So, how do we train the classifier? We can do this in two ways.
The first is to treat the model encoder as a feature extractor. The second is to fine-tune the entire model using an attached classifier head.
In this case, we extract token embeddings from the given text with the use of the pretrained model. We treat the extracted token embeddings as the text representations and train a classifier directly on them.
We don’t change the weights of the encoder. This is a great approach when we don’t have a GPU available, as it’s much less computationally intensive.
First, we need to load the encoder that will be used as the feature extractor. Just as before, we use the Auto class, which, based on the model identifier or local path, is able to load the appropriate model.
An additional parameter of the AutoModel class over the AutoTokenizer class is output_hidden_states, which ensures that the model returns token embeddings. We also want to process our data on the GPU, so we transfer the model to the VRAM by to(“cuda”).
Since we want the feature extraction process to be as optimized as possible, we’ll process the texts in batches. We’ll tokenize the text first, then pass it to the model to embed the tokens.
We intend to classify the text, so we’ll only retrieve the embedding associated with the first [CLS] token. This is the embedding that can represent the whole sentence.
The entire process of retrieving sentence embeddings from the text given the model and tokenizer is implemented in the helper function:
It’s also important to note that we used torch.no_grad() to disable the calculation of the gradient while calling the model in order to reduce memory consumption.
Also, since the model’s output is of the (batch_size, sequence_length, hidden_size) shape, and we want to get the hidden_output/token_embedding that’s associated with the first token in the sequence, we use the code snippet:
We call cpu().numpy() in order to move the embedding from the GPU to the CPU and cast it to the NumPy array format, as it’ll be easier to handle later on.
Now, we’re going to use our helper function to extract the sentence embeddings:
As we stated before, our embeddings are in the shape of (batch_size, 1, hidden_size). Because of that, we need to flatten them:
We can now define our classifier and fit it with our embeddings. We will use the SVM from the scikit-learn library to define our simple model
Clearly, the model is performing quite well, though it certainly can be better.
Previously, we treated our BERT language model as a feature extractor that created a sentence representation for later use in another machine learning model.
Now, we will attach a dense layer to the BERT language model as a classifier, and during training we’ll update the weight of not only the added layer, but the entire model.
We should create a dataset that can be accepted by the Hugging Face trainer class. In order to do so, we will first define the Dataset class inheriting from the Hugging Face Dataset class:
Then we have to convert the texts into Dataset objects for both the train and test corpus:
Now, we’ll load the encoder with the attached header on top. The Hugging Face library provides us with the special Auto class for that: AutoModelForSequenceClassification.
Then we define a helper function that returns the current performance of our model at the time of evaluation in terms of predefined scores:
Now, we need to define the basic parameters of the deep learning process and import a special Trainer class that performs the training, since it has previously implemented all the training loops.
We’re left with initializing a Trainer class with a predefined dataset, model, metrics, and training arguments. Once we’re done, that’s pretty much it.
After we start the training, let’s evaluate our dataset:
We can see that our model has significantly outperformed the previous classifier. This is because we tuned the whole model and adjusted not only the classification layer, but also the encoder itself, so that the embeddings produced can better discriminate topics.