K Nearest Neighbors with Python | ML

The K-Nearest Neighbors (KNN) algorithm is a simple and easy to implement supervised machine learning algorithm that can be used to solve both classification and regression problems.

The KNN algorithm assumes that things like this exist in the immediate vicinity. In other words, similar things next to each other. KNN reflects the idea of ​​similarity (sometimes called distance, proximity, or proximity) with some math we might have learned as a child, — calculating the distance between points on the graph. There are other ways to calculate distance, and one of them may be preferable depending on the problem we are solving. However, rectilinear distance (also called Euclidean distance) is a popular and familiar choice.

It is widely available in real-world scenarios because it is nonparametric, meaning it does not make any basic assumptions about the distribution of the data (unlike other algorithms, such as GMM, which assume a Gaussian distribution of the data)

This article demonstrates an illustration of K-nearest neighbors on a sample of random data using sklearn libraries .

Numpy , Pandas , sklearn

We were given a random dataset with one function as target classes … We will try to use KNN to create a model that directly predicts the class for a new function-based data point.

Importing libraries:

Let`s first represent our data with a few functions.

Get data:

Set index_col = 0 to use the first column as index.

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

df  = pd.read_csv ( "Data" , index_col = 0 )

 
df.head ()

Output:

Standardize the variables:
Since the KNN classifier predicts the class of a given test case by determining the closest observations to it, the scale of the variables matters … Any variables that are on a large scale will have a much larger impact on the distance between observations and therefore on the KNN classifier than variables that are on a small scale.

from sklearn.preprocessing import StandardScaler

 

scaler = StandardScaler ()

 

scaler.fit (df.drop ( `TARGET CLASS` , axis = 1 ))

scaled_features = scaler.transform (df.drop ( `TARGET CLASS` , axis = 1 ))

 

df_feat = pd.DataFrame (scaled_features, columns = df.columns [: - 1 ])

df_feat .head ()

Output:

Train the split test data and use the KNN model from the sklearn library:

from sklearn.model_selection import train_test_split

 

X_train, X_test, y_train, y_test = train_test_split (

  scaled_features , df [ `TARGET CLASS` ], test_size = 0.30 )

 
# Remember what we`re trying to come up with
# with the model to predict
# someone will be TARGET CLASS or not.
# Let`s start with k = 1.

 

from sklearn.neighbors import KNeighborsClassifier

 

knn = KNeighborsClassifier (n_neighbors = 1 )

 
knn.fit (X_train, y_train)

pred = knn.predict (X_test)

 
# Predictions and estimates
# Let`s rate our KNN model!

from sklearn.metrics import classification_report, confusion_matrix

print (confusion_matrix (y_test, pred))

 

print (classification_report (y_test, pred))

Exit:

 [[133 16] [15 136]] precision recall f1-score support 0 0.90 0.89 0.90 149 1 0.89 0.90 0.90 151 accuracy 0.90 300 macro avg 0.90 0.90 0.90 300 weighted avg 0.90 0.90 0.90 300 

Select K value :

Let`s go ahead and use the elbow method to pick a good value K

error_rate = []

  
# Will take some time

for < code class = "plain"> i in range ( 1 , 40 ):

 

knn = KNeighborsClassifier (n_neighbors = i)

knn.fit (X_train, y_train)

pred_i = knn.predict (X_test)

error_rate.append (np. mean (pred_i! = y_test))

 

plt.figure (figsize = ( 10 , 6 ))

plt.plot ( range ( 1 , 40 ), error_rate, color = `blue` ,

linestyle = `dashed` , marker = `o` ,

markerfacecolor = `red` , markersize = 10 )

 

plt.title ( `Error Rate vs. K Value` )

plt.xlabel ( `K` )

plt.ylabel ( ` Error Rate` )

Output:

Here we are we can see that after about K & gt; 15 the error rate just hovers between 0.07-0.08. Let`s retrain the model with this and check the classification report.

# FIRST QUICK COMPARISON WITH OUR ORIGINAL K = 1

knn = KNeighborsClassifier (n_neighbors = 1 )

 
knn.fit (X_train, y_train)

pred = knn.predict (X_test)

  

print ( `WITH K = 1` )

print ( `` )

print (confusion_matrix (y_test, pred))

print ( `` )

print (classification_report (y_test, pred))

 

 
# NOW FROM K = 15

knn = KNeighborsClassifier (n_neighbors = 15 )

 
knn.fit (X_train, y_train)

pred = knn.predict (X_test)

 

print ( `WITH K = 15` )

print ( `` )

print (confusion_matrix (y_test, pred) )

print ( `` )

print (classification_report (y_test, pred) )

Output:

 WITH K = 1 [[133 16] [15 136]] precision recall f1-score support 0 0.90 0.89 0.90 149 1 0.89 0.90 0.90 151 accuracy 0.90 300 macro avg 0.90 0.90 0.90 300 weighted avg 0.90 0.90 0.90 300 WITH K = 15 [[133 16] [6 145]] precision recall f1-score support 0 0.96 0.89 0.92 149 1 0.90 0.96 0.93 151 accuracy 0.93 300 macro avg 0. 93 0.93 0.93 300 weighted avg 0.93 0.93 0.93 300 

Large! We were able to increase the performance of our model by adjusting the best K value