Amazon Fine Food Reviews Dataset

Rayudu yarlagadda
5 min readAug 18, 2018

Objective : Given a review, determine whether the review is positive or negative.

The following are the steps we follow:

  1. Dataset overview
  2. Importing the dataset
  3. Exploratory data analysis
  4. Data Preprocessing
  5. Training and Testing
  6. Modeling
  7. Predictions
  8. Dataset overview

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Source: https://www.kaggle.com/amanai/amazon-fine-food-review-sentiment-analysis

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

The column or features in the dataset:

  1. Id
  2. ProductId — unique identifier for the product
  3. UserId — unqiue identifier for the user
  4. ProfileName
  5. HelpfulnessNumerator — number of users who found the review helpful
  6. HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
  7. Score — rating between 1 and 5
  8. Time — timestamp for the review
  9. Summary — brief summary of the review
  10. Text — text of the review

2. Loading the dataset

Before loading the data set lets import a bunch of libraries to solve the problem.

we are loading the dataset using pandas.

import pandas as pd
import numpy as np

df = pd.read_csv('Reviews.csv')
df.head()

In the above code the .head() function is used to display the first five rows in our dataset.

First five rows of the dataset

3. Exploratory Data Analysis:

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

In this we will remove duplicate values and missing values and we will focus on ‘text’ and ‘score’ columns because these two columns help us to predict the reviews.

The Score column is ranged from 1 to 5, and we will remove all Scores that are equal to 3 because we assume these are neutral and did not provide us any useful information. We will add a new column called “Positivity”, where score above 3 is represented as a 1,we say it was positively rated. Otherwise, it’ll be represented as a 0, indicating it was negatively rated.

df.dropna(inplace=True)
df[df['Score'] != 3]
df['Positivity'] = np.where(df['Score'] > 3, 1, 0)
df.head()
Deduplication of data .

In the above we can see that ‘positivity’ column is appended to the dataset and encoded that all the scores that are less than 3 are represented as ‘0’ and greater than 3 are represented as ‘1’.

4. Data Preprocessing:

we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following :-

  1. Begin by removing the html tags
    2. Remove any punctuations or limited set of special characters like , or . or # etc.
    3. Check if the word is made up of english letters and is not alpha-numeric
    4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
    5. Convert the word to lowercase
    6. Remove Stopwords
    7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)
Finding sentences containing HTML tags

In the above code snippet there are functions to remove stopwords and punctuation and html tags and also a function which will perform stemming.After that we will use n-grams to convert the text into numeric form,n-grams is available in nltk corpus.

The advantage of using n-grams is, it consider a pair of consequent words.if we use simple bag of words by using stop words it will lead to miss classification in some cases,so it is always good to use n-grams.

vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

5. Training and Testing:

As we have finished processing the review text , It’s time to split our data into a training and a test set using train_test_split from Scikit-learn. We will use 30% of the dataset for testing and remaining 70% is for training

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Positivity'], random_state = 0)print('X_train first entry: \n\n', X_train[0])
print('\n\nX_train shape: ', X_train.shape)

\6. Modeling

Logistic Regression

Here we can apply the logistic regression straight away because we know that the logistic regression works well for binary classification.And Logistic Regression works well for high dimensional sparse data.

from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2', random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False)

7.Predictions

Here, we will make predictions using X_test, and compute the area under the curve score on y_test.

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC: 0.838772959029

Let’s test our model:

print(model.predict(vect.transform(['This pasta is not tasty','The pasta is not worst , I will eat them again'])))

[0 1]

By using logistic regression we 83% of times our predicted value is equal to the actual value.

conclusion:

we successfully built a model that predicts the given reviews are positive are negative with an accuracy of 83%.

--

--