Sentiment analysis for Yelp review classification

The dataset

Importing the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
yelp = pd.read_csv('yelp.csv')
yelp.shapeOutput: (10000, 10)
yelp.head()
The first 5 rows of our dataset.
yelp.info()
Basic information about each column in our dataset.
yelp.describe()
More information about the numeric columns in our dataset.
yelp['text length'] = yelp['text'].apply(len)
yelp.head()
The first 5 rows of the yelp dataframe with text length feature added at the end.

Exploring the dataset

g = sns.FacetGrid(data=yelp, col='stars')
g.map(plt.hist, 'text length', bins=50)
Histograms of text length distributions for each star rating. Notice that there is a high number of 4-star and 5-star reviews.
sns.boxplot(x='stars', y='text length', data=yelp)
Box plot of text length against star ratings.
stars = yelp.groupby('stars').mean()
stars.corr()
Correlations between cool, useful, funny, and text length.
sns.heatmap(data=stars.corr(), annot=True)
Heat map of correlations between cool, useful, funny, and text length.

Independent and dependent variables

yelp_class = yelp[(yelp['stars'] == 1) | (yelp['stars'] == 5)]
yelp_class.shape
Output: (4086, 11)
X = yelp_class['text']
y = yelp_class['stars']

Text pre-processing

X[0]Output: 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'
import stringdef text_process(text):'''
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of words
'''
nopunc = [char for char in text if char not in string.punctuation]nopunc = ''.join(nopunc)

return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
sample_text = "Hey there! This is a sample review, which happens to contain punctuations."print(text_process(sample_text))Output: ['Hey', 'sample', 'review', 'happens', 'contain', 'punctuations']

Vectorisation

A matrix of token counts, indicating how many instances of a particular word appear in a review.
bow_transformer = CountVectorizer(analyzer=text_process).fit(X)
len(bow_transformer.vocabulary_)Output: 26435
review_25 = X[24]
review_25
Output: 'I love this place! I have been coming here for ages.
My favorites: Elsa's Chicken sandwich, any of their burgers, dragon chicken wings, china's little chicken sandwich, and the hot pepper chicken sandwich. The atmosphere is always fun and the art they display is very abstract but totally cool!'
bow_25 = bow_transformer.transform([review_25])
bow_25
Output:(0, 2099) 1
(0, 3006) 1
(0, 8909) 1
(0, 9151) 1
(0, 9295) 1
(0, 9616) 1
(0, 9727) 1
(0, 10847) 1
(0, 11443) 3
(0, 11492) 1
(0, 11878) 1
(0, 12221) 1
(0, 13323) 1
(0, 13520) 1
(0, 14481) 1
(0, 15165) 1
(0, 16379) 1
(0, 17812) 1
(0, 17951) 1
(0, 20044) 1
(0, 20298) 1
(0, 22077) 3
(0, 24797) 1
(0, 26102) 1
print(bow_transformer.get_feature_names()[11443])
print(bow_transformer.get_feature_names()[22077])
Output:
chicken
sandwich
X = bow_transformer.transform(X)
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density
= (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print(‘Density: {}’.format((density)))
Output:
Shape of Sparse Matrix: (4086, 26435)
Amount of Non-Zero occurrences: 222391
Density: 0.2058920276658241

Training data and test data

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Training our model

from sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()
nb.fit(X_train, y_train)

Testing and evaluating our model

preds = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_reportprint(confusion_matrix(y_test, preds))
print('\n')
print(classification_report(y_test, preds))
Output:
[[157 71]
[ 24 974]]
precision recall f1-score support 1 0.87 0.69 0.77 228
5 0.93 0.98 0.95 998
avg / total 0.92 0.92 0.92 1226

Data Bias

Predicting a singular positive review

positive_review = yelp_class['text'][59]
positive_review
Output: 'This restaurant is incredible, and has the best pasta carbonara and the best tiramisu I've had in my life. All the food is wonderful, though. The calamari is not fried. The bread served with dinner comes right out of the oven, and the tomatoes are the freshest I've tasted outside of my mom's own garden. This is great attention to detail.\n\nI can no longer eat at any other Italian restaurant without feeling slighted. This is the first place I want take out-of-town visitors I'm looking to impress.\n\nThe owner, Jon, is helpful, friendly, and really cares about providing a positive dining experience. He's spot on with his wine recommendations, and he organizes wine tasting events which you can find out about by joining the mailing list or Facebook page.'
positive_review_transformed = bow_transformer.transform([positive_review])nb.predict(positive_review_transformed)[0]Output: 5

Predicting a singular negative review

negative_review = yelp_class['text'][281]
negative_review
Output: 'Still quite poor both in service and food. maybe I made a mistake and ordered Sichuan Gong Bao ji ding for what seemed like people from canton district. Unfortunately to get the good service U have to speak Mandarin/Cantonese. I do speak a smattering but try not to use it as I never feel confident about the intonation. \n\nThe dish came out with zichini and bell peppers (what!??) Where is the peanuts the dried fried red peppers and the large pieces of scallion. On pointing this out all I got was " Oh you like peanuts.. ok I will put some on" and she then proceeded to get some peanuts and sprinkle it on the chicken.\n\nWell at that point I was happy that atleast the chicken pieces were present else she would probably end up sprinkling raw chicken pieces on it like the raw peanuts she dumped on top of the food. \n\nWell then I spoke a few chinese words and the scowl turned into a smile and she then became a bit more friendlier. \n\nUnfortunately I do not condone this type of behavior. It is all in poor taste...'
negative_review_transformed = bow_transformer.transform([negative_review])nb.predict(negative_review_transformed)[0]Output: 1

Where the model goes wrong…

another_negative_review = yelp_class['text'][140]
another_negative_review
Output: 'Other than the really great happy hour prices, its hit or miss with this place. More often a miss. :(\n\nThe food is less than average, the drinks NOT strong ( at least they are inexpensive) , but the service is truly hit or miss.\n\nI'll pass.'
another_negative_review_transformed = bow_transformer.transform([another_negative_review])nb.predict(another_negative_review_transformed)[0]Output: 5

Why the incorrect prediction?

--

--

--

Engineer, problem solver, tech enthusiast

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Choose your Accommodation on your next Airbnb Getaway

Hanging umbrellas on the streets of Kadikoy

Why documentation matters

Donut chart showing the example inputs for data science models at the DNC

Measuring Fairness in Machine Learning Models

Getting started with Python’s powerful Streamlit framework with a simple example

Access Fantasy Premier League Data (20/21) on Google Sheets

Indian Budget Analysis: Part 2

How to paraphrase ???

Research for Design Part 5

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rayudu yarlagadda

Rayudu yarlagadda

Engineer, problem solver, tech enthusiast

More from Medium

Snapchat Reviews Analysis

Sentiment analysis of tweets after the announcement of street fighters 6

Data Mining

Data Analysis // Liverpool performance v Tottenham Hotspur — Champions League Final 2019