Hugging Face Text Classification Tutorial Using PyTorch
In this example, we use an LSTM model exported from PyTorch to perform sentiment analysis on given movie reviews. We explain how to import libraries, import a Hugging Face dataset, the filtering and splitting of the dataset, tokenization, and the training and evaluation of our model.
Join the DZone community and get the full member experience.
Join For FreeWhat is PyTorch?
PyTorch is a deep learning open-source TensorFlow library that is based on the well-known Torch library. It's also a Python-based library that is more commonly used for natural language processing and computer vision. In this tutorial, we will be using PyTorch to train our model for Text Classification.
What Is Hugging Face?
Hugging Face is an open-source dataset (website) provider which is used mainly for its natural language processing (NLP) datasets among others. It contains tons of valuable high-quality data sets with quite a range and functionality. When searching for an NLP dataset, Hugging Face is a great go-to source.
Text Classification and Natural Language Processing
Text classification is categorizing data (usually in textual format) into different categories or groups. With data being the new currency of the world, it's no shock that companies are spending fortunes processing and utilizing this precious currency.
Text classification can even be classified into smaller subfields and usages. Typical usages of text classification include natural language processing. Natural Language Processing, or NLP, is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. In particular, it explores how to program computers to process and analyze large amounts of natural language data.
Defining the Goal of Our Text Classification Model
In the tutorial portion of this article, we will be using PyTorch and Hugging Face to run a text classification model. For our text classification purpose, we will be using natural language processing in order to identify the sentiment of a given sentence.
A sentiment is meant to categorize a given sentence as either emotionally positive or negative. For example, a positive sentiment would be “he worked so hard and achieved great things." Negative sentiment on the opposite side of the spectrum would be “his performance was not good enough.”
Our Text Classification Data Set
For this tutorial, we’ll be using the IMDB Dataset of 50K Movie Reviews. As the name indicates, this data set contains over 50,000 movie reviews written by actual users. Each review is divided into either a positive or a negative category depending on whether the viewer liked or disliked a given movie.
How To Build a Text Classification Model Using Hugging Face
Step 1: Import the Necessary Libraries
As the first step in any machine and deep learning model, we should download all the necessary libraries at the very beginning of their code.
During this tutorial, most of the libraries used here will hopefully become more clear, but to begin we’ll explain some of the main libraries that were just imported:
For beginners, we have the well-known NumPy library which allows you to freely create arrays and matrices. It also contains multiple linear algebra functions.
Using the sklearn library we can import and utilize the train_test_split
function. This function allows us to split our data into both a training and testing data set. As a programmer, you can choose the ratio of the split, among other parameters, but more on that later.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
import torch.nn as nn
import torch.nn.functional as F
from nltk.corpus import stopwords
from collections import Counter
import string
import re
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
Step 2: Check if CUDA is Available
CUDA is a parallel computing platform and an application programming interface that allows the software to use certain types of graphics processing units for general-purpose processing. This approach is called general-purpose computing on GPUs.
In this step, we are checking if a GPU is available to run or code on. If no GPU is provided, then the code will process on a normal CPU.
is_cuda = torch.cuda.is_available()
if is_cuda:
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU not available, CPU used")
Step 3: Importing and Reading the Data Set
If you're running your model on Kaggle, then you should do this step after importing the data set. After that, we will pass the file path of our dataset to the read_csv function (imported from the pandas library), which allows us to extract and read the data.
base_csv = '/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv'
df = pd.read_csv(base_csv)
df.head()
Note that the head function will print the first five lines of our dataset.
Step 4: Filtering and Cleaning the Data Set
No data set is perfect. Depending on the model, some extra modifications would be necessary for our model to reach optimally. In this case, we will drop a bunch of unrelated columns which are unnecessary for our model to work and are only slowing it down.
X,y = df['review'].values,df['sentiment'].values
Step 5: Splitting the Data Set
As stated briefly, a machine learning model requires two different data sets. The training data set is used to train and teach our model. After that, we have the testing data set, which, as the name implies, is used to test the accuracy of our newly trained model.
Some modifications used in this function and passed as extra parameters include
Test_size: by passing a value of 0.2 we are instructing thetrain_test_split
function to split our data set into 80% training and 20% testing sets.
Shuffle: as the name states, if the value of this parameter is equal to True, the data points in the set will be randomized or shuffled.
x_train,x_test,y_train,y_test = train_test_split(X,y,stratify=y)
Step 6: Importing the Tokenization Function
A significant part of natural language processing, Tokenization is the process of dividing raw text data into smaller chunks. This is done by splitting sentences into words, more commonly known as tokens.
The main concept behind tokenization is that by analyzing the different words present in a given text, we can interpret the meaning of such a text. We can also run statistical tools and methods to find hidden insights and patterns in the data.
To start with, we’ll import the tokenize method, which takes our dataset as the input and performs this tokenization process.
def preprocess_string(s):
# Remove all non-word characters (everything except numbers and letters)
s = re.sub(r"[^\w\s]", '', s)
# Replace all runs of whitespaces with no space
s = re.sub(r"\s+", '', s)
# replace digits with no space
s = re.sub(r"\d", '', s)
return s
def tokenize(x_train,y_train,x_val,y_val):
word_list = []
stop_words = set(stopwords.words('english'))
for sent in x_train:
for word in sent.lower().split():
word = preprocess_string(word)
if word notin stop_words and word != '':
word_list.append(word)
corpus = Counter(word_list)
# sorting on the basis of most common words
corpus_ = sorted(corpus,key=corpus.get,reverse=True)[:1000]
# creating a dict
onehot_dict = {w:i+1 for i,w in enumerate(corpus_)}
# tokenize
final_list_train,final_list_test = [],[]
for sent in x_train:
final_list_train.append([onehot_dict[preprocess_string(word)] for word in sent.lower().split()
if preprocess_string(word) in onehot_dict.keys()])
for sent in x_val:
final_list_test.append([onehot_dict[preprocess_string(word)] for word in sent.lower().split()
if preprocess_string(word) in onehot_dict.keys()])
encoded_train = [1 if label =='positive' else 0 for label in y_train]
encoded_test = [1 if label =='positive' else 0 for label in y_val]
return np.array(final_list_train), np.array(encoded_train),np.array(final_list_test), np.array(encoded_test),onehot_dict
Step 7: Resplitting the Dataset After Tokenization
The tokenize()
function takes both our training and testing datasets as inputs and resplits them after performing tokenization.
x_train,y_train,x_test,y_test,vocab = tokenize( x_train,y_train,x_test,y_test)
Step 8: Padding
In typical sentence classification, sentences are padded with 0's to get sentences of equal length and to allow subsequent classification. Meaning that the following padding_()
function will pad each sentence with extra 0’s until they have the same length.
def padding_(sentences, seq_len):
features = np.zeros((len(sentences), seq_len),dtype=int)
for ii, review in enumerate(sentences):
if len(review) != 0:
features[ii, -len(review):] = np.array(review)[:seq_len]
return features
We perform padding on the X values in the training and testing data set. Since most reviews have a length of below 500, we will only consider sentences below the 500 range.
x_train_pad = padding_(x_train,500)
x_test_pad = padding_(x_test,500)
Step 9: Patching and Loading as Tensor
Batch size is the number of samples processed before the model is updated. In this case, we have selected a batch size of 50.
The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset. Here we create the Tensor datasets and define the required batch size.
train_data = TensorDataset(torch.from_numpy(x_train_pad), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(x_test_pad), torch.from_numpy(y_test))
batch_size = 50
# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()
Step 10: Importing the Model
As with any deep learning model, we import our deep learning algorithm as a class. In this case, we will use an RNN model. This method will be used later (in step 10) in order to run our final model.
class SentimentRNN(nn.Module):
def __init__(self,no_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.5):
super(SentimentRNN,self).__init__()
self.output_dim = output_dim
self.hidden_dim = hidden_dim
self.no_layers = no_layers
self.vocab_size = vocab_size
# embedding and LSTM layers
self.embedding = nn.Embedding(vocab_size, embedding_dim)
#lstm self.lstm = nn.LSTM(input_size=embedding_dim,hidden_size=self.hidden_dim,
num_layers=no_layers, batch_first=True)
# dropout layer
self.dropout = nn.Dropout(0.3)
# linear and sigmoid layer
self.fc = nn.Linear(self.hidden_dim, output_dim)
self.sig = nn.Sigmoid()
def forward(self,x,hidden):
batch_size = x.size(0)
# embeddings and lstm_out
embeds = self.embedding(x) # shape: B x S x Feature since batch = True
#print(embeds.shape) #[50, 500, 1000]
lstm_out, hidden = self.lstm(embeds, hidden)
lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
# dropout and fully connected layer
out = self.dropout(lstm_out)
out = self.fc(out)
# sigmoid function
sig_out = self.sig(out)
# reshape to be batch_size first
sig_out = sig_out.view(batch_size, -1)
sig_out = sig_out[:, -1] # get last batch of labels
# return last sigmoid output and hidden state
return sig_out, hidden
def init_hidden(self, batch_size):
''' Initializes hidden state '''
# Create two new tensors with sizes n_layers x batch_size x hidden_dim,
# initialized to zero, for hidden state and cell state of LSTM
h0 = torch.zeros((self.no_layers,batch_size,self.hidden_dim)).to(device)
c0 = torch.zeros((self.no_layers,batch_size,self.hidden_dim)).to(device)
hidden = (h0,c0)
return hidden
Step 11: Defining Our Model's Parameters
Finally, we pass the following parameters to the already imported SentimentRNN method that we imported in the previous steps.
no_layers = 2
vocab_size = len(vocab) + 1 #extra 1 for padding
embedding_dim = 64
output_dim = 1
hidden_dim = 256
Passing our parameters to the model and running it.
model = SentimentRNN(no_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.5)
#moving to gpu
model.to(device)
Step 12: Training Our Text Classification Model
Here we create the loss and optimization functions along with the accuracy method.lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
def acc(pred,label):
pred = torch.round(pred.squeeze())
return torch.sum(pred == label.squeeze()).item()
In this part of the code, we define the epoch function. The number of epochs in a given model is the number of complete passes through the training dataset. In this case, we set it equal to 5.
clip = 5
epochs = 5
valid_loss_min = np.Inf
# train for some number of epochs
epoch_tr_loss,epoch_vl_loss = [],[]
epoch_tr_acc,epoch_vl_acc = [],[]
for epoch in range(epochs):
train_losses = []
train_acc = 0.0
model.train()
# initialize hidden state
h = model.init_hidden(batch_size)
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([each.data for each in h])
model.zero_grad()
output,h = model(inputs,h)
# calculate the loss and perform backprop
loss = criterion(output.squeeze(), labels.float())
loss.backward()
train_losses.append(loss.item())
# calculating accuracy
accuracy = acc(output,labels)
train_acc += accuracy
#`clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(model.parameters(), clip)
optimizer.step()
val_h = model.init_hidden(batch_size)
val_losses = []
val_acc = 0.0
model.eval()
for inputs, labels in valid_loader:
val_h = tuple([each.data for each in val_h])
inputs, labels = inputs.to(device), labels.to(device)
output, val_h = model(inputs, val_h)
val_loss = criterion(output.squeeze(), labels.float())
val_losses.append(val_loss.item())
accuracy = acc(output,labels)
val_acc += accuracy
epoch_train_loss = np.mean(train_losses)
epoch_val_loss = np.mean(val_losses)
epoch_train_acc = train_acc/len(train_loader.dataset)
epoch_val_acc = val_acc/len(valid_loader.dataset)
epoch_tr_loss.append(epoch_train_loss)
epoch_vl_loss.append(epoch_val_loss)
epoch_tr_acc.append(epoch_train_acc)
epoch_vl_acc.append(epoch_val_acc)
Step 13: Evaluating the Final Results of Our Text Classification Model
We import the predict_text()
function, which we will be used to test the accuracy of our model.
def predict_text(text):
word_seq = np.array([vocab[preprocess_string(word)] for word in text.split()
if preprocess_string(word) in vocab.keys()])
word_seq = np.expand_dims(word_seq,axis=0)
pad = torch.from_numpy(padding_(word_seq,500))
inputs = pad.to(device)
batch_size = 1
h = model.init_hidden(batch_size)
h = tuple([each.data for each in h])
output, h = model(inputs, h)
return(output.item())
We run the predict_text()
function with a given review and check its accuracy.
index = 32
print(df['review'][index])
print('='*70)
print(f'Actual sentiment is : {df["sentiment"][index]}')
print('='*70)
pro = predict_text(df['review'][index])
status = "positive" if pro > 0.5 else "negative"
pro = (1 - pro) if status == "negative" else pro
print(f'Predicted sentiment is {status} with a probability of {pro}')
My first exposure to the Templarios & not a good one. I was excited to find this title among the offerings from Anchor Bay Video, which has brought us other cult classics such as "Spider Baby." The print quality is excellent, but this alone can't hide the fact that the film is deadly dull. There's a thrilling opening sequence in which the villagers exact terrible revenge on the Templars (and set the whole thing in motion), but everything else in the movie is slow, ponderous and, ultimately, unfulfilling. Adding insult to injury: the movie was dubbed, not subtitled, as promised on the video jacket.=========================================================Actual sentiment is : negative=========================================================predicted sentiment is negative with a probability of 0.9017044752836227
To find the original code of the example used in this tutorial, check out the Sentiment analysis using LSTM - PyTorch code on Kaggle.
Using Hugging Face Datasets for Text Classification
Now we have explained a bit about PyTorch, which is one of the most well-known machine learning libraries out there. We’ve also touched on what Hugging Face is, what text classification is, and what natural language processing (also known as NLP) is.
Then we moved on to a practical machine-learning example written in Python. In this example, we used an LSTM model exported from the PyTorch package in order to perform sentiment analysis on given movie reviews. We explained how to import the necessary libraries, how to import the required dataset, the filtering and splitting of the dataset, tokenization, and the training and evaluation of our model.
If you are a machine learning expert or new to the field, then adding a subfield such as NLP to your list of skills will definitely pay off!
Published at DZone with permission of Kevin Vu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments