LLMs for Bad Content Detection: Pros and Cons
This post evaluates two different methods for identifying harmful content on the Internet: training supervised classifiers and using large language models.
Join the DZone community and get the full member experience.
Join For FreeHarmful content detection involves detecting content that is harmful to Internet users. Examples of harmful content include hateful/offensive content, spam, harassment, sexual content, phishing/scams, and solicitation.
Harmful content on content platforms can have a huge negative impact, including
- Emotional distress, humiliation, and even physical harm to the users
- Damage to the reputation of the platforms that host it
- reduction in active users and difficulty in attracting advertisers
So, it is crucial to be able to identify and oversee harmful content, facilitating its removal. User-generated content platforms are at risk of this as they allow users to upload a wide range of content. UGC platforms include social media, messaging services, forums, gaming platforms, and marketplaces. Detection and mitigation of harmful content on these platforms hold significant importance.
To minimize the number of users exposed to such content, platforms often rely on automated detection and take down of harmful content. Automated detection can be a challenging task, as harmful content can take many forms (text, videos, images, links, etc.), and it can be difficult to distinguish between what is harmful and what is not. On top of this, false positives (automated systems incorrectly identifying something as being harmful) can also have a number of negative effects, including harm to the users, damage to the platform's reputation, potential legal challenges, and so on. Platforms use artificial intelligence (AI) to automatically detect harmful content, but they must carefully balance the detection of harmful content with the avoidance of false positives.
Supervised Classifiers
The most popular approach used for the automated detection of harmful content today is training classifiers (supervised machine learning models) to detect harmful content using a labeled dataset. A labeled dataset for a particular harm type consists of a number of both harmful and benign examples. The training process consists of feature extraction from the content followed by training of supervised classifiers using the extracted features and labels in the dataset.
With the emergence of pre-trained foundational models, the number of labeled datasets required has however been significantly reduced. The training process in the case of text classification, for example, in the foundational model approach, involves taking a pre-trained model, such as a BERT or RoBERTa, to generate embeddings of text, and using the embeddings as features to train traditional supervised classifiers. This approach requires a much smaller labeled dataset. Embeddings are fixed-length vector representations of text in our dataset used to capture the meaning. Thus, the supervised model learns to classify whether the meaning of the text is harmful or not.
Here are some examples of free, open-source foundational models that can be used as described above or fine-tuned for the purpose of classification.
Images can be additionally processed through optical character recognition (OCR), and audio/video can be processed through automated speech recognition (ASR) to extract text that can be subjected to harmful content detection.
Here is some sample code to train a hate classifier. This should train and output a model in a local directory called "hate"
from datasets import load_dataset
from transformers import (
AutoTokenizer,
DataCollatorWithPadding,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import numpy as np
from datasets import load_metric
# Load any dataset of choice for training.
hate_dataset = load_dataset("SetFit/toxic_conversations")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_train = hate_dataset["train"].map(preprocess_function, batched=True)
tokenized_test = hate["test"].map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2)
def compute_metrics(eval_pred):
load_accuracy = load_metric("accuracy")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = load_accuracy.compute(
predictions=predictions, references=labels)["accuracy"]
return {"accuracy": accuracy}
training_args = TrainingArguments(
output_dir="hate",
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()
Disadvantages of Supervised Classifiers
While using foundational models that have been trained on large amounts of text significantly reduces the number of labeled training examples needed to train a classifier, there are some disadvantages to this technique:
- Supervised learning still requires labeled data, which may have to be created manually. This can be time-consuming and expensive to collect.
- Supervised learning models can be sensitive to noise in the data. This means that even a small amount of incorrect or irrelevant data can significantly degrade the performance of the model.
- Supervised learning models can be biased if the training data is biased. This means that the model may learn to make predictions that are not accurate or fair.
N-Shot Classification Using Large Language Models
N-shot classification is a machine learning technique that allows a model to classify objects from previously unseen classes without receiving any specific training for those classes. This can be done by providing the model with a set of class descriptions, which the model can then use to learn the features that distinguish the different classes.
To prompt an LLM to detect bad content, one can use a variety of techniques. One common technique is to use a natural language question, such as "Is this text hate speech?" The LLM can then be used to answer this question by predicting the class of the text. Another technique is to use a prompt that provides more information about the text, such as "This text contains the word 'hate' and the phrase 'kill all Immigrants.' Is it hate speech?" The LLM can then use this information to make a more informed decision about the class of the text. In addition to the question, a few examples can be provided as part of the prompt to help the LLM improve its performance.
The advantages of using LLMs for zero-shot classification of harmful content include:
- LLMs can be trained on large datasets of text and code, which makes them more robust to variations in the way that harmful content is written.
- They can be used to classify harmful content from previously unseen classes and subclasses without receiving any specific training for those classes. This makes them well-suited for emerging forms of harmful content.
- They can be used to detect harmful content in a variety of languages. This makes them a valuable tool for global content moderation.
- Most importantly, a big dataset is not needed for training a supervised classifier, which can reduce operations costs, time to launch, and.
Here is some sample ChatGPT API code to detect hate speech. It uses a 0-shot classification, but the N-shot would be similar. It is impressive to see how much smaller the amount of code is below.
import openai
openai.api_key = # insert key here
def detect_hate(input_text):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo", # Or gpt-4
temperature=0, # For deterministic response with the most likely answer
messages = [
{"role": "system",
"content" :
"""You are an expert content moderator."""},
{"role": "user",
"content" : "Is this hate speech? ```%s```" %(input_text)}])
return response["choices"][0]["message"]["content"]
Disadvantages of using LLMs for zero-shot/N-shot classification
- They can be computationally expensive to train and deploy. It is highly discouraged to train a new large language model, and it is encouraged to use either proprietary models like GPT4, Palm 2, Claude 2 or open source models like LLAMA 2 and Falcon. Even with the usage of these models, the inference can be computationally expensive.
- They can be susceptible to bias, which can lead to misclassification of harmful content.
- It is hard to scale detection horizontally as proprietary models can have their own rate limits
- This will also require sharing potentially sensitive user-generated private data with external parties
- The additional computation brings in additional latency, and external service calls add further latency to detection depending on the size of the prompt.
- While a training dataset is not needed, it is still important to evaluate the prompts for the performance. Small changes to prompts can lead to large changes in performance.
- Complicated model-specific prompt engineering that does not apply across models may be required and can still require some initial learning investment.
Conclusion
Harmful content detection is a challenging but important task. By using the right approach, it is possible to develop systems that can effectively detect harmful content and protect users from harm. Large language models can help with N-shot classification and help a team quickly launch classifiers to detect a wide number of harmful content types across languages without the need for a large training dataset, while supervised detection using smaller models can help the team do it with lower latency, cost, in house and at scale with good training data.
Opinions expressed by DZone contributors are their own.
Comments