Simple Sentiment Analysis With NLP
Look at a simple application of sentiment analysis using Natural Language Processing techniques.
Join the DZone community and get the full member experience.
Join For FreeIn this article, I will develop a simple application of sentiment analysis using natural language processing techniques.
Following the developments in Artificial Intelligence, the number of Artificial Intelligence applications developed for natural language processing is increasing day by day. Applications developed with NLP will enable us to use infrastructure that works faster and more accurately by eliminating human power in many jobs. Common examples of applications developed with NLP are the followings.
- Text Classification (Spam Detector etc)
- Sentiment Analysis
- Author Recognition
- Machine Translate
- Chatbot
s
Sentiment analysis is one of the most common applications in natural language processing. With Sentiment analysis, we can decide what emotion a text is written.
With the widespread use of social media, the need to analyze the content that people share over social media is increasing day by day. Considering the volume of data coming through social media, it is quite difficult to do this with human power. Therefore, the need for applications that can quickly detect and respond to the positive or negative comments that people write is increasing day by day. In this paper, we will develop a baseline model for simple analysis of sentiment.
First of all, we will give you information about the data set that we will make a sentiment analysis.
Data Set Name: Sentiment Labelled Sentences Data Set
Data Set Source:UCI Machine Learning Libarary
Data Set Info: This dataset was created with user reviews collected via 3 different websites (Amazon, Yelp, Imdb). These comments consist of restaurant, film and product reviews. Each record in the data set is labeled with two different emoticons. These are 1: Positive, 0: Negative.
We will create a sentiment analysis model using the data set we have given above.
We will build the Machine Learning model with the Python programming language using the sklearn and nltk library.
Now we can go to the writing part of our code.
First, let's import the libraries we will use.
import pandas as pd
import numpy as np
import pickle
import sys
import os
import io
import re
from sys import path
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from string import punctuation, digits
from IPython.core.display import display, HTML
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
Now let's upload and view our data set.
#Amazon Data
input_file = "../data/amazon_cells_labelled.txt"
amazon = pd.read_csv(input_file,delimiter='\t',header=None)
amazon.columns = ['Sentence','Class']
#Yelp Data
input_file = "../data/yelp_labelled.txt"
yelp = pd.read_csv(input_file,delimiter='\t',header=None)
yelp.columns = ['Sentence','Class']
#Imdb Data
input_file = "../data/imdb_labelled.txt"
imdb = pd.read_csv(input_file,delimiter='\t',header=None)
imdb.columns = ['Sentence','Class']
#combine all data sets
data = pd.DataFrame()
data = pd.concat([amazon, yelp, imdb])
data['index'] = data.index
data
Yes, we imported the data and viewed it. Now, let's look at the statistics about the data.
#Total Count of Each Category
pd.set_option('display.width', 4000)
pd.set_option('display.max_rows', 1000)
distOfDetails = data.groupby(by='Class', as_index=False).agg({'index': pd.Series.nunique}).sort_values(by='index', ascending=False)
distOfDetails.columns =['Class', 'COUNT']
print(distOfDetails)
#Distribution of All Categories
plt.pie(distOfDetails['COUNT'],autopct='%1.0f%%',shadow=True, startangle=360)
plt.show()
As you can see, the data set is very balanced. There are almost equal numbers of positive and negative classes.
Now, before using the data set in the model, let's do a few things to clear the text (pre processing).
#Text Preprocessing
columns = ['index','Class', 'Sentence']
df_ = pd.DataFrame(columns=columns)
#lower string
data['Sentence'] = data['Sentence'].str.lower()
#remove email adress
data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)
#remove IP address
data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)
#remove punctaitions and special chracters
data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','')
#remove numbers
data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)
#remove stop words
for index, row in data.iterrows():
word_tokens = word_tokenize(row['Sentence'])
filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')]
df_ = df_.append({"index": row['index'], "Class": row['Class'],"Sentence": " ".join(filtered_sentence[0:])}, ignore_index=True)
data = df_
We made the pre-cleaning of the data ready for use within the model. Now, before we build our model, let's split our dataset to test (10%) and training(90%).
X_train, X_test, y_train, y_test = train_test_split(data['Sentence'].values.astype('U'),data['Class'].values.astype('int32'), test_size=0.10, random_state=0)
classes = data['Class'].unique()
Now we can create our model using our training data. In creating the model, I will use the TF-IDF as the vectorizer and the Stochastic Gradient Descend algorithm as the classifier. We found these methods and the parameters in the method using grid search (I will not mention grid search in this article).
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
#grid search result
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1,2), max_features=50000,max_df=0.5,use_idf=True, norm='l2')
counts = vectorizer.fit_transform(X_train)
vocab = vectorizer.vocabulary_
classifier = SGDClassifier(alpha=1e-05,max_iter=50,penalty='elasticnet')
targets = y_train
classifier = classifier.fit(counts, targets)
example_counts = vectorizer.transform(X_test)
predictions = classifier.predict(example_counts)
Our model has occurred. Now let's test our model with test data. Let's examine the accuracy, precision, recall and f1 results.
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
#Model Evaluation
acc = accuracy_score(y_test, predictions, normalize=True)
hit = precision_score(y_test, predictions, average=None,labels=classes)
capture = recall_score(y_test, predictions, average=None,labels=classes)
print('Model Accuracy:%.2f'%acc)
print(classification_report(y_test, predictions))
Model Accuracy:0.83
precision recall f1-score support
0 0.83 0.84 0.84 139
1 0.84 0.82 0.83 136
avg / total 0.83 0.83 0.83 275
As we have seen, the success of our model was 83%. Now let's look at the confusion matrix, where we can see more clearly how accurate our estimates are.
#source: https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#print("Normalized confusion matrix")
else:
print()
plt.imshow(cm, interpolation='nearest', cmap=cmap, aspect='auto')
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.figure(figsize=(150,100))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions,classes)
np.set_printoptions(precision=2)
class_names = range(1,classes.size+1)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')
classInfo = pd.DataFrame(data=[])
for i in range(0,classes.size):
classInfo = classInfo.append([[classes[i],i+1]],ignore_index=True)
classInfo.columns=['Category','Index']
classInfo
With this study, we have developed a Natural Language Processing project. As I said at the beginning of the article, our model is a baseline model. The aim of this article was to develop an application that could be considered an introduction to Natural Language Processing. I hope there has been a useful article in terms of awareness.
Opinions expressed by DZone contributors are their own.
Comments