Naive Bayesian classifiers

Python Methods and Functions

Naive Bayesian classifiers are a collection of classification algorithms based on Bayesian theorem . This is not a separate algorithm, but a family of algorithms in which they all adhere to the general principle, that is, each classifiable pair of features is independent of each other.

First, let's look at a dataset.

Consider a fictional dataset that describes the weather conditions for a golf game. Given the weather conditions, each tuple classifies the conditions as good ("Yes") or not good ("No") for golf.

Here is a tabular view of our dataset.

Windy class = "amp-wp-inline-f74fb0de8d71e0cc53e95b25025d2853"> 1 2 Overcast wp-td class =" e-f74fb0de8d71e0cc53e95b25025d2853 "> 3 Overcast 9 -wp class = " inline-f74fb0de8d71e0cc53e95b25025d2853 "> 11 "amp-wp-inline-f74fb0de8d71e0cc53e95b25025d2853"> 13 No
Outlook Temperature Play Golf
0 Raind class = "amp-wp-inline-2f0158eb062d1ac553a7edcb8a744628"> Hot High < / td> False No
Rainycc High True No
2 Hot False Yes
Sunny High False
4 Cool Normal "> False Yes
td> Sunny Cool Normal True No
6 Cool True Yes
7 Rainy High False No
8 Cool Normal class = "amptd> False Yes
Sunny Mild Normal False 10 Rainy Mild True Yes
Overcast High Truecctd535395395 Yes
12 Overcast Hot Normal False Yes
Sunny High True

The dataset is split into two parts, namely the feature matrix and response vector .

  • A feature matrix contains all vectors (rows) of a dataset, where each vector consists of a value of dependent features . In the above dataset, the following features are available: Appearance, Temperature, Humidity, and Windy.
  • The response vector contains the value of a class variable (forecast or output ) for each row of the feature matrix. In the above dataset, the class variable is named Play Golf.


The fundamental naive Bayesian assumption is that each feature produces:

  • independent
  • equal

contributes to the outcome .

In relation to our dataset, this concept can be understood as:

  • We assume that no pair of functions is dependent. For example, the temperature "Hot" has nothing to do with humidity, or the forecast "Rainy" does not affect the wind. Hence, features are assumed to be independent .
  • Second, each feature is assigned the same weight (or importance). For example, knowing only temperature and humidity cannot accurately predict the outcome. None of the attributes are meaningful and are assumed to contribute the same contribution to the result.

Note: assumptions made by naive Bayes are generally not true in real life situations. In fact, the assumption of independence is never correct, but it often works well in practice.

Now, before moving on to the formula for naive Bayes, it is important to know about Bayes' theorem.

Bayes 'Theorem

Bayes' Theorem finds the probability of an event occurring given the probability of another event that is already occurred. Bayes' theorem is mathematically formulated as the following equation:

where A and B & # 8212 ; events, and P (B)? 0.

  • Basically, we are trying to find the probability of event A, given that event B is true. Event B is also called proof .
  • P (A) — it is a priori A ( prior probability, i.e. the probability of an event before evidence is obtained). The proof is the value of the attribute of the unknown instance (here event B).
  • P (A | B) — the posterior probability of B, i.e., the probability of an event after proof.

Now, for our dataset, we can apply Bayes' theorem as follows:

where, y — class variable, and X — vector of dependent objects ( n size), where:

Just to clarify, an example of a vector of objects and the corresponding class variable might be: (see 1st row of dataset)

  X = (Rainy, Hot, High, False) y = No 

So basically P (X | y) here means the probability of “not playing golf” given that the weather conditions are “rainy”, “temperature hot”, “high humidity "and" no wind ".

Naive assumption

Now it's time to naively rely on Bayes' theorem, namely independence from singularities. So now we have divided proofs into independent parts.

Now, if any two events A and B are independent, then

  P (A, B) = P (A) P (B) 

Therefore, we achieve the result:

which can be expressed as:

Now, since the denominator remains constant for the given input, we can remove this term:

Now we need to create a classifier model. To do this, we find the probability of a given set of input data for all possible values ​​of the class variable y and select the output with the maximum probability. This can be expressed mathematically as:

So finally, we are faced with the task of calculating P (y) and P (x i | y).

Note that P (y) is also called the class probability , and P (x i | y) is called conditional probability .

Different naive Bayesian classifiers differ mainly in the assumptions they make relative to the distribution of P (x i | y).

Let's try to apply the above formula manually to our weather dataset. To do this, we need to do some preliminary calculations for our dataset.

We need to find P (x i | y j ) for each x i in X and y j in y. All these calculations were demonstrated in the tables below:

So, in the picture above, we computed P (x i | y j ) for each x i in X and y j to y manually in Tables 1-4. For example, the likelihood of playing golf, given that the temperature is cool, i.e. P (temp. = Cool | play golf = Yes) = 3/9.

We also need to find the probabilities of the classes (P (y)) that were calculated in Table 5. For example, P ( play golf = Yes) = 9/14.

So now we have finished our preliminary calculations and the classifier is ready!

Let's test it for a new set of features (let's call it today) :

  today = (Sunny, Hot, Normal, False) 

So, the probability of playing golf is defined as:

and the probability of not playing golf is defined as:

Since P (today) is common to both probabilities, we can ignore P (today) and find the proportional probabilities as:


Now that As

These numbers can be converted to probability making the sum equal to 1 (normalization):



So the forecast that golf will be played is yes.

The method we discussed above applies to discrete data. In the case of continuous data, we need to make some assumptions about the distribution of values ​​for each feature. The various naive Bayesian classifiers differ mainly in the assumptions they make about the distribution of P (x i | y).

We now discuss one such classifier here.

Gaussian Naive Bayesian Classifier

In Gaussian Naive Bayesian Method it is assumed that the continuous values ​​associated with each feature are distributed according to the Gaussian distribution . Gaussian distribution is also called normal distribution . When plotted, the graph produces a bell-shaped curve that is symmetrical about the mean of the object, as shown below:

The probability of features is assumed to be Gaussian, therefore, the conditional probability is defined as:

We now look at implementing a Gaussian naive Bayesian classifier using scikit-learn.

# load set iris data

from sklearn.datasets import load_iris

iris = load_iris ()

# save the object matrix (X) and the response vector (y)

X =

y =

# splitting X and Y into training and test suites

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.4 , random_state = 1 )

# training the model on the training set

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB () (X_train, y_train )

# make predictions on the test set

y_pred = gnb.predict (X_test)

# comparing actual response values ​​(y_test ) with predicted response values ​​(y_pred)

from sklearn import metrics

print ( "Gaussian Naive Bayes model accuracy (in%):" , metrics.accuracy_score (y_test, y_pred) * 100 )


 Gaussian Naive Bayes model accuracy (in%): 95.0 

Others popular naive Bayesian classifiers:

  • Polynomial naive Bayesian vector. Function vectors represent the frequencies with which certain events were generated by the polynomial distribution . This is an event model commonly used to classify documents.
  • Naive Bayes Bernoulli's algorithm. In the multivariate Bernoulli event model, the features are independent logical (binary variables) describing the input data. Like the polynomial model, this model is popular for document classification problems where binary occurrence terms (i.e., the word occurs in the document or not) are used rather than frequency terms (i.e., the frequency of a word in a document).

When we get to the end of this article, here are some important points to think about:

  • Despite their oversimplified assumptions, naive Bayesian classifiers have performed quite well in many real-life situations. cool classifying documents and filtering spam. Little training data is required to estimate the required parameters.
  • Naive Bayesian learners and classifiers can be extremely fast compared to more complex methods. Disaggregating the distributions of conditional characteristics of a class means that each distribution can be independently evaluated as a univariate distribution. This, in turn, helps alleviate the problems associated with the curse of dimensionality.


This blog is courtesy of Nikhil Kumar . If you like Python.Engineering and would like to contribute, you can also write an article using, or email your article to [email protected] See my article appearing on the Python.Engineering homepage and help other geeks.

Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.