Application of the polynomial naive Bayesian approach to NLP problems

P (c | x) = P (x | c) * P (c) / P (x)

Naive Bayesian in mainly used in natural language processing (NLP) tasks. A naive Bayesian predicts a text tag. They calculate the likelihood of each tag for a given text and then output the tag with the highest value.

How does naive Bayesian algorithm work?

Let`s take an example, classify an overview whether it is positive or negative.

Training dataset:

Text Reviews
“I liked the movie” positive
“It`s a good movie. Nice story ” positive
“ Nice songs. But sadly boring ending. ” negative
“ Hero`s acting is bad but heroine looks good. Overall nice movie ” positive
“ Sad, boring movie ” negative

We classify whether the text “generally liked the movie” has a positive review or a negative review. We have to calculate,
P (positive | liked the movie overall) — the likelihood that the sentence tag is positive given that the sentence “liked the movie as a whole”. 
P (negative | liked the movie overall) — the likelihood that the sentence tag is negative given that the sentence “liked the movie as a whole.”

Before that, firstly, we apply Stop Word and Stemming Removal in the text.

Remove Stop Words : These are ordinary words that add nothing to the classification, such as skill, yet, ever, and so on.

Stemming : Stemming to infer the root of a word.

Now, after applying these two methods, our text becomes

Text Reviews
“ ilikedthemovi ” positive
“itsagoodmovienicestori” positive
“nicesongsbutsadlyboringend” negative
“herosactingisbadbutheroinelooksgoodoverallnicemovi” positive
“sadboringmovi” negative

Feature:
The important part is finding the features of the data to make machine learning algorithms work. In this case, we have text. We need to convert this text to numbers on which we can perform calculations. We use word frequencies. That is, each document is considered as a set of words that it contains. Our specifics will be counts of each of these words.

In our case, we have P (positive | movie liked overall) using this theorem:

P (positive | overall liked the movie) = P (overall liked the movie | positive) * P (positive) / P (overall liked the movie)

Since for our classifier we have to figure out which tag has the highest probability, we can drop the divisor, which is the same for both tags,

P (generally liked the movie | positive ) * P (positive) with P (overall liked the movie | negative) * P (negative)

However, there is a problem: “Generally liked the movie” does not appear in our training dataset, so the probability is zero … Here we are assuming a “naive” condition that each word in a sentence is independent of the others. This means that we are now looking at individual words.

We can write it like:

P (overall liked the movie) = P (overall) * P (liked ) * P (the) * P (movie)

The next step is to apply Bayes` theorem:

P (overall liked the movie | positive ) = P (overall | positive) * P (liked | positive) * P (the | positive) * P (movie | positive)

And now, these individual words actually appear several times in our training data and we can calculate them!

Calculating probabilities:

First we calculate the prior probability of each tag: for a given sentence in our training data, the probability that it is positive P (positive) is 3/5. Then P (negative) is 2/5.

Then calculating P (general | positive) means counting how many times the word “general” occurs in positive texts (1), divided by the total number of words in positive (eleven). Therefore, P (overall | positive) = 1/17, P (liked / positive) = 1/17, P (positive / positive) = 2/17, P (movie / positive) = 3/17.

If the probability turns out to be zero, then using Laplace smoothing: we add 1 to each score so that it never equals zero. To counterbalance this, we add the number of possible words to the divisor so that the division never exceeds 1. In our case, the total number of possible words is 21.

Applying anti-aliasing, the results are:

Word P (word | positive) P (word | negative)
overall 1 + 1/17 + 21 0 + 1/7 + 21
liked 1 + 1/17 + 21 0 + 1/7 + 21
the 2 + 1/17 + 21 0 + 1/7 + 21
movie 3 + 1/17 + 21 1 + 1/7 + 21

Now we just multiply all the probabilities and see who is bigger:

P (overall | positive) * P (liked | positive) * P (the | positive ) * P (movie | positive) * P (postive) = 1.38 * 10 ^ {- 5} = 0.0000138

P (overall | negative) * P (liked | negative) * P (the | negative ) * P (movie | negative) * P (negative) = 0.13 * 10 ^ {- 5} = 0.0000013

Our classifier gives a “generally liked movie” positive tag.

Below is the implementation:

# text cleaning

import pandas as pd

import re

import nltk

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer  

 

dataset = [[ "I liked the movie" , "positive" ],

[ "It`s a good movie. Nice story " , " positive " ],

["Hero`s acting is bad but heroine looks good. 

Overall nice movie "," positive "],

  [ "Nice songs. But sadly boring ending. " , " negative " ],

[ "sad movie, boring movie" , "negative" ]]

 

dataset = pd.DataFrame (dataset)

dataset.columns = [ "Text" , "Reviews" ]

 

nltk.download ( `stopwords` )

 

cor pus = []

 

for i in range ( 0 , 5 ):

text = re.sub ( `[^ a-zA-Z] ` ,` `, dataset [` Text`] [i])

text = text.lower ()

  text = text.split ()

  ps = PorterStemmer ()

text = `` .join (text)

corpus.append (text )

 
# create a bag of words

cv = CountVectorizer (max_features = 1500 )

  

X = cv.fit_transform (corpus) .toarray ()

y = dataset .iloc [:, 1 ]. values ​​

# splitting the dataset into training and test cases

from sklearn.cross_validation import train_test_split

 

X_train, X_test, y_train, y_test = train_test_split (

X, y , test_size = 0.25 , random_state = 0 )

< code>


# fitting naive bayes to the training set

from sklearn.naive_bayes import GaussianNB

from sklearn .metrics import confusion_matrix

 

classifier = GaussianNB (); 

classifier.fit (X_train, y_train)

  
# predicting test case results

y_pred = classifier.predict (X_test)

 
# creating confusion

cm = confusion_matrix (y_test, y_pred)

cm