Categorizing Content Without Labels Using Zero-Shot Classification

Learn how zero-shot classification makes it easy to categorize content without needing labeled data by using pre-trained models for efficient results.

Vamsi Kavuri

CORE ·

Dec. 02, 24 · Tutorial

Likes (1)

Comment

Save

689 Views

Usually, when we want to classify content, we rely on labeled data and train machine learning models with that labeled data to make predictions on new or unseen data. For example, we might label each image in the image dataset as "dog" or "cat" or categorize an article as "tutorial" or "review." These labels help the model learn and make predictions on new data.

But here is the problem: getting labeled data is not always easy. Sometimes, it can be really expensive or time-consuming, and on top of that, new labels might pop up as time goes on. That is where zero-shot classification comes into the picture. With zero-shot models, we can classify content without needing to train on every single labeled class beforehand. These models can generalize to new categories based on natural language by using pre-trained language models that have been trained on huge amounts of text.

Zero-Shot Classification With Hugging Face

In this article, I will use Hugging Face's Transformers library to perform zero-shot classification with a pre-trained BART model. Let's take a quick summary of a DZone article and categorize it into one of the following categories: "Tutorial," "Opinion," "Review," "Analysis," or "Survey."

Environment Setup

Ensure Python 3.10 or higher is installed.
Install the necessary packages mentioned below.

    Shell
   
   pip install transformers torch

Now, let's use the following short summary from my previous article to perform zero-shot classification and identify the category mentioned above:

Summary:

"Learn how Integrated Gradients help identify which input features contribute most to the model's predictions to ensure transparency."

    Python
   
 

   from transformers import pipeline

# Initializing zero-shot classification pipeline using BART pre-trained model
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# tl;dr from this article - https://dzone.com/articles/integrated-gradients-ai-explainability
article_summary = """
Learn how Integrated Gradients helps identify which input features contribute most to the model's predictions to ensure transparency.
"""

# sample categories from DZone - 
sample_categories = ["Tutorial", "Opinion", "Review", "Analysis", "Survey"]

# Now, classify article into one of the sample categories.
category_scores = zero_shot_classifier(article_summary, sample_categories)

# pick the category with highest score and print
cateogry = category_scores['labels'][0]
print(f"The article is most likely a '{cateogry}'")
  

The model classified that the article is most likely a Tutorial. We could also check the scores of each category instead of picking the one with the highest score.

    Python
   
   # Print score for each category 
for category in range(len(category_scores['labels'])):
    print(f"{category_scores['labels'][i]}: {category_scores['scores'][i]:.2f}")

Here is the output:

    Plain Text
   
 

   Tutorial: 0.53
Review: 0.20
Survey: 0.12
Analysis: 0.10
Opinion: 0.06
  

These scores are helpful if you want to use zero-shot classification to identify the most appropriate tags for your content.

Conclusion

In this article, we explored how zero-shot classification can be used to categorize content without the need to train on labeled data. As you can see from the code, it is very easy to implement and requires just a few lines.

While easy and flexible, these models might not work well in specialized categories where the model does not understand the specific terminology. For example, classifying a medical report into one of the categories like "Cardiology," "Oncology," or "Neurology" requires a deep understanding of medical terms that were not part of the model's pre-training. In those cases, you might still need to fine-tune the model with specific datasets for better results.

Additionally, zero-shot models may have trouble with ambiguous language or context-dependent tasks, such as detecting sarcasm or cultural references.

Machine learning Agent-based model Natural language generation

Opinions expressed by DZone contributors are their own.

Related

Trending