Top 10 Hugging Face Datasets
Explore 10 Hugging Face datasets to train your newest NLP model.
Join the DZone community and get the full member experience.
Join For FreeThe most important task in any machine learning model is finding or building a dataset that perfectly accommodates your algorithm. Without the correct foundation, your machine learning model may not perform in its intended way.
While well-known sites such as Kaggle allow you to download and utilize thousands of adequate datasets, a few other dataset providers are increasing in popularity. In this article, we will cover one known as Hugging Face.
Hugging Face is an open-source dataset provider used mainly for its natural language processing (NLP) datasets. What is an NLP dataset? What are some of its uses?
NLP is a branch of artificial intelligence responsible for computer and human interaction using natural languages. It focuses on processing large amounts of human-understandable language (usually in text format) to extract hidden patterns and insights.
NLP has many benefits and real-life applications such as: categorizing items (text), detecting hate speech, and filtering out spam e-mails and messages.
Below we’ll take a deeper dive into NLP datasets provided by Hugging Face, what data they contain, how it is organized, and what they can be utilized for.
Top 10 Hugging Face Datasets List
1. IMDB Dataset
The IMDB dataset provides users with over 50,000 highly polar movie reviews that are labeled as either ‘positive’ or ‘negative’ depending on the written comment.
The data is divided into two equal parts, one for the training dataset and the other for the testing dataset with additional unlabeled data in case the user requires it. This dataset can detect positive and negative movie feedback in different text messages. Moreover, it can help identify features a movie was particularly enjoyed or disliked for.
2. Amazon Polarity Dataset
This dataset contains over 35 million product reviews from Amazon. Each data point includes the customer’s review and the rating for the given product. Each data point is classified as either a positive review or negative review, depending on whether the customer liked or disliked the product.
This type of labeled dataset is useful in NLP and machine learning. By using the Amazon polarity dataset companies can boost their advertising and marketing capabilities. As in the case of marketing, using NLP techniques allow marketers to see which products a customer liked and know which features made the customer decide to buy a product.
Similar datasets include the Yelp review full dataset which contains a massive amount of reviews that are labeled by their given rating (from 1 to 5). Similar to the Amazon dataset mentioned earlier, using a dataset like this in NLP can benefit the marketing efforts of a restaurant or service company.
Furthermore, the Amazon Polarity Datasets or Yelp review dataset can be used in recommendation systems to classify products or businesses into different categories. Categorization helps the app or website filter customer preferences and increase organization.
3. Emotion Dataset
The emotions dataset classifies English Twitter messages into six categories:
- Sadness
- Joy
- Love
- Anger
- Fear
- Surprise
This type of dataset can be used to train and test an NLP model that focuses on capturing a user's emotion by reading a text passage from them. Other uses include detecting and eliminating discouraging messages (hate speech) by utilizing the anger and sadness data point categories.
A similar dataset is a Twitter-based dataset. This dataset classifies users’ tweets into different emojis including laughter, love, happiness, and more. Like the previous dataset, the tweet evaluation dataset can also be used for NLP which focuses on different emotions represented as emojis.
4. Common Voice Dataset
This dataset contains a mix of both recording and textual data points. The Common Voice dataset contains over 9 thousand hours of recorded messages with their written transcript counterpart. Additional data points such as the age, gender, and accent of the speaker are also available to help boost the model's voice detection performance.
This dataset can be utilized to create and improve the accuracy of a voice detection model capable of understanding over 60 languages from all over the world. Programs that utilize voice detection models are becoming more ingrained in mainstream technology such as Google Home, Alexa, and Siri, all of which need to understand multiple users’ voice input.
5. Silicone Dataset
This dataset classifies sentences as either being commissive, directive, informative, or just a normal question. The Silicone dataset covers a variety of different domains including phone conversations, television dialogue, and more. All the given date points are written in English.
This dataset can be used for training and evaluating natural language models and in understanding systems designed specifically for spoken languages.
6. Yahoo Answers Topics Dataset
Containing a large number of questions and their respective answers, the Yahoo answer dataset classifies each data point (question and answer) into a given category. Such genres include sports, business & finance, society & culture, science & mathematics, family & relationships, computers & the internet, and more.
This dataset can be utilized to train a model to categorize certain questions and answers into one of these categories.
7. Hate Speech Dataset
CONTENT WARNING: Note that this dataset contains offensive text. The hate speech dataset contains a sample of text messages obtained from the Stormfront forum. Each data point is labeled as either hate or non-hate message depending on its contents. As the name implies, this type of dataset can be used to train a model to detect hate speech through different online forums.
A similar dataset would be the hate speech offensive dataset which contains this type of content. This dataset can be utilized to train a model to filter and ban certain words from being able to be said on forums, video games (with children demographics), and search bar inquiries.
8. Scan Dataset
The scan dataset is a simple language-driven task for studying compositional learning and zero-shot generalization.
An example of a data point that you might find in the scanned dataset would be split into a command such as walk opposite to the left twice, thus the actual action that should be expected would be to walk right twice.
9. SMS Spam Dataset
The SMS spam dataset contains over 5,000 English SMS messages that are categorized as either spam or ham(non-spam) messages.
Filtering out spam messages is one of the main uses of using NLP. You can also train an e-mail filtering system by using a labeled e-mail spam dataset or any system that requires spam filtering.
10. Banking 77 Dataset
The Banking 77 dataset is more complicated and contains over 13,000 customer messages (complaints and issues) sent to banks.
Each data point is categorized into one of seventy-seven different intents. Intents include the customer inquiring about card arrival, card not working issues, an extra charge on the card, and declined transfer issues.
Using this type of dataset would allow banks to respond quickly and categorize different customer issues into a more organized structure for later use. Similar models can be built for any business that receives large amounts of customer requests daily. But first a good filtered and processed dataset needs to be provided to run the model.
Other Interesting Hugging Face Datasets
Here are three additional interesting datasets from Hugging Face to explore.
1. Lair Dataset
The Lair dataset contains more than 12,000 labeled statements from politicians from all over the world.
Each statement is classified as false, half true, mostly true, and true.
Using the Lair dataset, a machine learning model may be able to detect the trustfulness of similar future statements.
2. Google Well-Formed Query Dataset
Created by crowdsourcing "well-formed" annotations for 25,100 queries from the Parallax corpus, this Google query dataset labels every data point by how well-informed the query is.
Five users annotate each query as either well informed or not.
By using this dataset, machine learning models can further predict how well-informed a given query is.
3. Jfleg Dataset
Considered to be a gold standard benchmark, the Jfleg dataset is an English grammatical error correction dataset. Every data point contains a written sentence (with multiple grammatical and spelling mistakes) and another four grammatical and spelling-wise corrected sentences written by four different humans.
Training with this type of dataset would allow our model to detect and correct grammatical errors it finds. Note that, similar to most machine learning models, this model may not guarantee perfect grammatical and spelling corrections in all its cases. Another note: Depending on the desired outcome of the task (spam filter, hate speech detector, reviews), choosing the correct dataset will significantly affect the model performance.
Try running your model on a couple of the above-mentioned datasets and then check the achieved performance. You can also search for your own datasets and compare them to the ones covered here.
Using Hugging Face Datasets
With so many potential uses like organizing items (text) into different categories (for further recommendation system processing), detecting hate speech, and filtering out spam e-mails, working with NLP is a skill worth learning.
In this article, we explored Hugging Face, an open-source website containing a massive amount of NLP datasets (mainly dedicated toward NLP machine learning models), and covered 10 datasets to help you start improving your machine learning career.
We recommend trying out some examples above and learning how to use these datasets with your machine learning model. You can feel free to check for other datasets on Hugging Face or other websites to fulfill your model’s requirements.
Published at DZone with permission of Kevin Vu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments