Text Classification With BERT
Text classification is a machine learning subfield that teaches computers how to classify text into different categories. In this tutorial, we will use BERT to develop your own text classification model.
Join the DZone community and get the full member experience.
Join For FreeWhat Is Text Classification?
Text classification is a machine learning subfield that teaches computers how to classify text into different categories. It's commonly used as a supervised learning technique, which means that the algorithm is trained on a set of texts that have already been labeled with their respective categories. Once it's been trained on this data, the algorithm can use what it's learned to make predictions about new, unlabeled texts.
The algorithm looks for patterns in the text to determine which category it belongs to. It's like when learning to recognize a certain type of flower — we start to notice certain features that distinguish it from other types of flowers. With text classification, the algorithm is doing the same thing but with words and phrases instead.
Text classification is a versatile tool that is widely used in many real-world applications that you may have come across. For instance, an email that ended up in your spam folder is text classification at work. The model can differentiate between spam and non-spam emails by studying specific words or phrases that identify a given email as spam, such as "Congratulations, you have won" or "Today is your lucky day."
Text classification is also valuable in analyzing social media posts' sentiment, particularly when it comes to detecting negative sentiments like hate speech. By using a machine learning model, the text can be classified and monitored for offensive language and hate speech.
But text classification isn't just for serious applications — it can also be used for fun things like categorizing news articles and videos by topic. This way, a user can easily find articles and videos that interest them without having to sift through irrelevant content. Text classification truly is a powerful tool with a variety of practical applications.
What Is BERT for Deep Learning?
BERT, short for Bidirectional Encoder Representations from Transformers, is a powerful natural language processing (NLP) model developed by Google that uses a deep neural network architecture based on the state-of-the-art transformer model.
As we said earlier, the BERT model architecture is based on a deep neural network called a transformer, which is different from traditional NLP models that process text one word at a time. Instead, transformers can process the entire text input all at once, which helps them to capture the relationships between words and phrases more effectively.
How Does the BERT Model Work for Text Classification?
BERT uses a multi-layer bidirectional transformer encoder to represent the input text in a high-dimensional space. That means it can take into account the entire context of each word in the sentence, which helps it to better understand the meaning of the text.
One of the most interesting things about BERT is that it's a pre-trained model. This means that BERT can be trained on massive amounts of text data, such as books, articles, and websites, before it's fine-tuned for specific downstream NLP tasks, including text classification.
By pre-training on a large corpus of text data, BERT can develop a deep understanding of the underlying structure and meaning of language, making it a highly effective tool for NLP tasks. Once pre-trained, BERT can be fine-tuned for specific tasks, which allows it to adapt to the specific nuances of the task and improve its accuracy.
It's important to note that there are two different BERT variations: BERT base and BERT large. For the remainder of this article, we'll be utilizing the BERT base model, which is a more compact version of BERT that still maintains a strong grasp of context and linguistic nuances. By employing BERT base, we can take advantage of its abilities while lowering computational demands, making it more suitable for a variety of text classification tasks and achieving remarkable results.
Tutorial on Text Classification Using BERT
So why do we fine-tune BERT on the IMDB movie review dataset? Well, we want to tailor the already powerful BERT model for sentiment analysis tasks. BERT is excellent at understanding language structure and context, but it doesn't naturally have sentiment analysis skills. By fine-tuning BERT for text classification with a labeled dataset, such as IMDB movie reviews, we give it the ability to accurately predict sentiments in the sentences it encounters.
In case you would like to run the below model yourself, you can find the IMDB data set in the following link: IMDB Dataset of 50K Movie Reviews.
Step 1: Import the Necessary Libraries
This snippet of code is all about importing the essential tools we need for our project. We're using PyTorch for the deep learning functionality, the transformers library for BERT, and essential methods from the scikit-learn library in order to handle data and check how well our model does.
import os import torch from torch import nn from torch.utils.data import DataLoader, Dataset from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report import pandas as pd
Step 2: Import the IMDB Data Set and Preprocess It
The below code defines a function load_imdb_data that reads a CSV file containing IMDB movie reviews and their corresponding sentiments. It returns a list of review texts and a list of labels, where 1 represents a positive sentiment, and 0 represents a negative sentiment.
def load_imdb_data(data_file): df = pd.read_csv(data_file) texts = df['review'].tolist() labels = [1 if sentiment == "positive" else 0 for sentiment in df['sentiment'].tolist()] return texts, labels
We will then save our data set directory and insert it as an input to the load_imdb_data() function.
data_file = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv" texts, labels = load_imdb_data(data_file)
Step 3: Create a Custom Dataset Class for Text Classification
This is a custom dataset class that helps organize movie reviews and their sentiments for our BERT model. It takes care of tokenizing the text, handling the sequence length, and providing a neat package with input IDs, attention masks, and labels for our model to learn from.
class TextClassificationDataset(Dataset): def __init__(self, texts, labels, tokenizer, max_length): self.texts = texts self.labels = labels self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.texts) def __getitem__(self, idx): text = self.texts[idx] label = self.labels[idx] encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True) return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label)}
Step 4: Build Our Custom BERT Classifier
In this step, we aim to create our own custom BERT classifier. The classifier is built on top of the famous BERT model, which is great at understanding text. We will then add a dropout layer to keep things in check and a linear layer to help us classify text.
Our BERTClassifier takes in some input IDs and attention masks, and runs them through BERT and the extra layers we added. The classifier returns our output as class scores.
class BERTClassifier(nn.Module): def __init__(self, bert_model_name, num_classes): super(BERTClassifier, self).__init__() self.bert = BertModel.from_pretrained(bert_model_name) self.dropout = nn.Dropout(0.1) self.fc = nn.Linear(self.bert.config.hidden_size, num_classes) def forward(self, input_ids, attention_mask): outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask) pooled_output = outputs.pooler_output x = self.dropout(pooled_output) logits = self.fc(x) return logits
Step 5: Define the train() Function
The train() function takes the model, data loader, optimizer, scheduler, and device as its trainees. The function puts the model into training mode and then runs through each batch of data from the data loader. For each batch, it clears the optimizer's gradients, gets the input IDs, attention masks, and labels, and feeds them to the model.
def train(model, data_loader, optimizer, scheduler, device): model.train() for batch in data_loader: optimizer.zero_grad() input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['label'].to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask) loss = nn.CrossEntropyLoss()(outputs, labels) loss.backward() optimizer.step() scheduler.step()
Step 6: Build Our Evaluation Method
def evaluate(model, data_loader, device): model.eval() predictions = [] actual_labels = [] with torch.no_grad(): for batch in data_loader: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['label'].to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask) _, preds = torch.max(outputs, dim=1) predictions.extend(preds.cpu().tolist()) actual_labels.extend(labels.cpu().tolist()) return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)
Step 7: Build Our Prediction Method
The predict_sentiment() function acts as our evaluation method. For each batch, it gets the input IDs, attention masks, and labels and feeds them to the model. The model then gives its best predictions, which are compared to the actual labels.
Finally, the function calculates the accuracy score and a classification report to let us know how well the model did in understanding movie reviews' sentiments.
def predict_sentiment(text, model, tokenizer, device, max_length=128): model.eval() encoding = tokenizer(text, return_tensors='pt', max_length=max_length, padding='max_length', truncation=True) input_ids = encoding['input_ids'].to(device) attention_mask = encoding['attention_mask'].to(device) with torch.no_grad(): outputs = model(input_ids=input_ids, attention_mask=attention_mask) _, preds = torch.max(outputs, dim=1) return "positive" if preds.item() == 1 else "negative"
Step 8: Define Our Model’s Parameters
Here, we are going to set up essential parameters for fine-tuning the BERTClassifier, including the BERT model name, number of classes, maximum input sequence length, batch size, number of training epochs, and learning rate, to help the model effectively understand movie reviews and their sentiments.
# Set up parameters bert_model_name = 'bert-base-uncased' num_classes = 2 max_length = 128 batch_size = 16 num_epochs = 4 learning_rate = 2e-5
Step 9: Loading and Splitting the Data
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)
Step 10: Initialize Tokenizer, Data Set, and Data Loader
tokenizer = BertTokenizer.from_pretrained(bert_model_name) train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_length) val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length) train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
Step 11: Set Up the Device and Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = BERTClassifier(bert_model_name, num_classes).to(device)
Step 12: Set Up Optimizer and Learning Rate Scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate) total_steps = len(train_dataloader) * num_epochs scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
Step 13: Training the Model
for epoch in range(num_epochs): print(f"Epoch {epoch + 1}/{num_epochs}") train(model, train_dataloader, optimizer, scheduler, device) accuracy, report = evaluate(model, val_dataloader, device) print(f"Validation Accuracy: {accuracy:.4f}") print(report)
Saving the final model:
torch.save(model.state_dict(), "bert_classifier.pth")
Step 14: Evaluating Our Model’s Performance
# Test sentiment prediction test_text = "The movie was great and I really enjoyed the performances of the actors." sentiment = predict_sentiment(test_text, model, tokenizer, device) print("The movie was great and I really enjoyed the performances of the actors.") print(f"Predicted sentiment: {sentiment}")
Output: The movie was great and I really enjoyed the performances of the actors.
Predicted sentiment: positive.
# Test sentiment prediction test_text = "The movie was so bad and I would not recommend it to anyone." sentiment = predict_sentiment(test_text, model, tokenizer, device) print("The movie was so bad and I would not recommend it to anyone.") print(f"Predicted sentiment: {sentiment}")
Output: The movie was so bad and I would not recommend it to anyone.
Predicted sentiment: negative.
# Test sentiment prediction test_text = "Worst movie of the year." sentiment = predict_sentiment(test_text, model, tokenizer, device) print("Worst movie of the year.") print(f"Predicted sentiment: {sentiment}")
Output: Worst movie of the year.
Predicted sentiment: negative.
Final Thoughts on Text Classification Using BERT
To sum it up, BERT has seriously changed the game when it comes to text classification. It's made things like sentiment analysis and topic sorting a whole lot better and faster. By taking those pre-trained models and customizing them for our own projects, we're getting amazing results that help us out in the real world.
As we keep experimenting with BERT and similar models, there's no doubt we'll see even more cool stuff happening in the world of AI and language understanding.
Published at DZone with permission of Kevin Vu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments