ML | Implementing the KNN classifier using Sklearn



K-Nearest Neighbors is one of the most basic yet important classification algorithms in machine learning. It belongs to a supervised learning field and finds wide application in pattern recognition, data mining, and intrusion detection. It is widely available in real-world scenarios because it is nonparametric, meaning it does not make any basic assumptions about the distribution of the data (unlike other algorithms like GMM, which assume a Gaussian distribution of the data).

this article will demonstrate how to implement the Nearest Neighbor Classifier algorithm, using Sklearn library from Python.

Step 1: Import required libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt 

import seaborn as sns

Step 2: Read the dataset

Step 3: Train the model

cd C: UsersDevDesktopKaggleBreast_Cancer
# Change the location of the read file to the file location

 

df = pd.read_csv ( `data.csv` )

  

y = df [ ` diagnosis` ]

X = df.drop ( ` diagnosis` , axis = 1 )

X = X.drop ( `Unnamed: 32` , axis = 1 )

X = X.drop ( `id` , axis = 1 )

# Separate dependent and independent

 

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.3 , random_state = 0 )

# Splitting data into training and testing data

K = []

training = []

test = []

scores = { }

 

for k in range ( 2 , 21 ):

clf = KNeighborsClassifier (n_neighbors = k)

  clf.fit (X_train, y_train)

  

training_s core = clf.score (X_train, y_train)

  test_score = clf.score (X_test, y_test)

K.append (k)

 

training.append (training_score)

test.append (test_score)

  scores [k] = [training_score, test_score]

Step 4: Model Evaluation

for keys, values ​​ in scores.items ():

print (keys, `:` , values)

Now let`s try to find the optimal value for "k", that is, the number of nearest neighbors .

Step 5: Graph learning and test results

ax = sns.stripplot (K, training); 

ax. set (xlabel = `values ​​of k` , ylabel = `Training Score`

  
plt.show ()
# plot show function

ax = sns. stripplot (K, test); 

ax. set (xlabel = `values ​​of k` , ylabel = `Test Score` )

plt.show ()

plt.scatter (K, training, color = `k` )

plt.scatter (K, test, color = `g` )

plt.show ()
# For overlapping dot diagrams amm


From the above scatter plot we can conclude that the optimal k value would be around 5.