Python | NLP review analysis Restaurant



# Importing libraries

import numpy as np 

import pandas as pd 

 
# Import dataset

dataset = pd.read_csv ( `Restaurant_Reviews.tsv` , delimiter = ` `

To download the Restaurant_Reviews.tsv dataset in use, click here .

Step 2: Clean up or preprocess the text

  • Remove punctuation, numbers : Punctuation, numbers are not very helpful in text processing, if enabled, they will just increase the size of the word batch we create as the last step and reduce the efficiency of the algorithm.
  • Stemming : root the word

  • Convert each word to lowercase : for example, it is useless to have the same words in different cases (for example, “Good” and “good”).

# library to clean up data

import re 

 
# Natural Language Toolkit

import nltk 

 

nltk.download ( `stopwords` )

 
# remove stop -word

from nltk.corpus import stopwords

 
# for stemming we offer

from nltk.stem.porter import PorterStemmer

  
# Initialize an empty array
# add clear text

corpus = [] 

  
# 1000 (Feedback) lines to clean up

for i in range ( 0 , 1000 ): 

 

# column : & quot; Review & quot ;, string

review = re.sub ( ` [^ a-zA-Z] ` , `` , dataset [ `Review` ] [i]) 

 

# convert all cases to lowercase

review = review.lower () 

 

# split into an array (by default it is the delimiter "& quot;")

review = review .split () 

 

# create a PorterStemmer object for

# take the main stem of each word

  ps = PorterStemmer () 

 

# loop to define each word

# in a string array on the i-th line

review = [ps.stem (word) for word in review

if not word in set (stopwords.words (  `english` ))] 

  

# reunite with all elements of the string array

# create back to string

review = `` . join (review) 

 

# add each line to create

# an array of plain text

corpus.append (review) 

Examples: before and after applying the above code yes (reviews = & gt; before, body = & gt; after)

Step 3:

For this we need CountVectorizer class CountVectorizer from sklearn.feature_extraction.text. 
We can also set the maximum number of features (the maximum number of features that help the most with the "max_features" attribute). Train on the corpus and then apply the same transformation to the corpus ".fit_transform (corpus)" and then convert it to an array. If the feedback is positive or negative, the answer is in the second column of the dataset [:, 1]: all rows and 1st column (indexed from zero).

# Create Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

 
# Extract max 1500 features.
# & quot; max_features & quot; is an attribute
# experiment with for better results

cv = CountVectorizer (max_features = 1500

 
# X contains a corpus (dependent variable)

X = cv.fit_transform (corpus) .toarray () 

  
# y contains responses when viewed
# is positive or negative

y = dataset.iloc [:, 1 ]. values 

Description of the dataset to be used:

  • Columns seperated by (tab space)
  • First column is about reviews of people
  • In second column, 0 is for negative review and 1 is for positive review

Step 5: Separation of the body into training and test set. For this we need the train_test_split class from sklearn.cross_validation. The split can be done 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via "test_size". 
X — a bag of words, u — 0 or 1 (positive or negative).

# Splitting the dataset into
# Training suite and Test suite

from sklearn.cross_validation import train_test_split

 
# experiment with & quot; test_size & quot;
# to get better results

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.25 )

Step 6: Selecting a forecasting model (here b random forest)

  • Since Random fored is an ensemble model (made from many trees) from sklearn.ensemble, import the RandomForestClassifier class
  • With 501 trees or "n_estimators" and the entropy criterion
  • Fit the model using the .fit () method with the X_train and y_train attributes

Random forest classification
# to the training set

from sklearn.ensemble import RandomForestClassifier

  
# n_estimators can be said as a number
# trees, experiment with n_estimators
# to get better results

model = RandomForestClassifier (n_estimators = 501 ,

  criterion = `entropy` )

 
model.fit (X_train, y_train) 

Step 7: Determining the final results using the .predict () method with the X_test attribute

# Predict test case results

y_pred = model.predict (X_test)

  
y_pred

Note: Accuracy with random forest was 72% (may vary if experimenting with different test size, here = 0.25).

Step 8: You need a confusion matrix to know the accuracy.

Confusion matrix — it is a 2X2 matrix.

TRUE POSITIVE: measures the proportion of actual positives that are correctly identified.
TRUE NEGATIVE: measures the proportion of actual positives that are not correctly identified.
FALSE POSITIVE: measures the proportion of actual negatives that are correctly identified.
FALSE NEGATIVE: measures the proportion of actual negatives that are not correctly identified.

Note. True or false means that the assigned classification is correct or incorrect, and positive or negative refers to the assignment of a positive or negative category. 

# Create confusion matrix

from sklearn.metrics import confusion_matrix

 

cm = confusion_matrix (y_test, y_pred)

 
cm