Python k-nearest neighbor algorithm



Supervised Learning falls into two categories:

  1. Refinement : Here our target variable consists of categories.
  2. Regression : here our target variable is continuous and we usually try to find the line of the curve.

Since we realized that we need labeled data to do supervised learning. How can we get the tagged data? There are various ways to get tagged data:

  1. Historically tagged data
  2. An experiment to get the data. We can conduct experiments to create labeled data, such as A / B testing.
  3. Crowdsourcing

Now is the time to understand algorithms that can be used to solve the problem of controlled machine learning. In this post, we will be using the popular scikit-learn .

Note: There are few other packages as well like TensorFlow, Keras etc to perform supervised learning.

K-Nearest Neighbor Algorithm:

This algorithm is used to solve classification model problems … The K-Nearest Neighbor or K-NN algorithm basically creates an imaginary boundary for classifying data. When new data points arrive, the algorithm will try to predict this to the closest border.

Therefore, a larger k value means softer partition curves, resulting in less complex models. Whereas a smaller k-value tends to outperform the data and lead to complex models.

Note. It is very important to have the correct k-value when analyzing the dataset to avoid overfitting and underfitting. fitting a dataset.

Using the k-nearest neighbor algorithm, we fit historical data (or train the model) and predict the future.

Example of k-nearest neighbor algorithm

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

 
# Loading data

irisData = load_iris ()

  
# Create objects and target arrays

X = irisData.data

y = irisData.target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

  X, y, test_size = 0.2 , random_state = 42 )

 

knn = KNeighborsClassifier (n_neighbors = 7 )

  
knn.fit (X_train, y_train)

 
# Predict a dataset that the model has not seen before

print (knn.predict (X_test))

The above example takes the following steps:

  1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
  2. Create feature and target variables.
  3. Divide the data into training and test data.
  4. Create a k-NN model using the value neighbors.
  5. Train or put the data in the model.
  6. Predict the future.

We have seen how we can use the K-NN algorithm to solve supervised machine learning problems. But how do you measure the accuracy of a model?

Consider the example below, where we predicted the performance of the above model:

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

  
# Loading data

irisData = load_iris ()

 
# Create objects and target arrays

X = irisData.data

y = irisData.target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.2 , random_state = 42 )

  

knn = KNeighborsClassifier (n_neighbors = 7 )

 
knn. fit (X_train, y_train)

 
# Calculate model accuracy

print (knn.score (X_test, y_test))

Model accuracy:
Everything is going fine. But how do you determine the correct k-value for a dataset? Obviously, we need to be familiar with the data in order to get the range of the expected k value, but to get the exact k value, we need to test the model for each expected k value. Refer to the example shown below.

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

import numpy as np

import matplotlib.pyplot as plt

  

irisData = load_iris ()

 
# Creating Objects and Targets

X = irisData.data

y = irisData. target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.2 , random_state = 42 )

 

neighbors = np.arange ( 1 , 9 )

train_accuracy = np.empty ( len (neighbors))

test_accuracy = np.empty ( len (neighbors))

 
# Loop over K values ​​

for i, k in enumerate (neighbors):

knn = KNeighborsClassifier (n_neighbors = k)

  knn.fit (X_train, y_train)

  

# Calculate training and data validation accuracy

train_accuracy [i] = knn.score (X_train, y_train )

test_accuracy [i] = knn.score (X_test, y_test)

 
# Create plot

plt.plot (neighbors, test_accuracy, label = `Testing dataset Accuracy` )

plt.plot (neighbors, train_accuracy, label = `Training dataset Accuracy` )

  
plt.legend ()

plt.xlabel ( ` n_neighbors` )

plt.ylabel ( `Accuracy` )

plt.show ()

Output:

Here, in the example shown above, we create a graph to see the k value for which we have high precision.

Note. This is a method that is not used in the industry to select the correct value for n_neighbors. Instead, we tune the hyperparameter to select the value that provides the best performance. We will cover this in future posts.

Summary —
In this post, we understood what supervised learning is and what its categories are. With a basic understanding of supervised learning, we examined the k-nearest neighbor algorithm that is used to solve supervised machine learning problems. We also explored measuring the accuracy of the model.