Change language

Python k-nearest neighbor algorithm

| |

Supervised Learning falls into two categories:

  1. Refinement : Here our target variable consists of categories.
  2. Regression : here our target variable is continuous and we usually try to find the line of the curve.

Since we realized that we need labeled data to do supervised learning. How can we get the tagged data? There are various ways to get tagged data:

  1. Historically tagged data
  2. An experiment to get the data. We can conduct experiments to create labeled data, such as A / B testing.
  3. Crowdsourcing

Now is the time to understand algorithms that can be used to solve the problem of controlled machine learning. In this post, we will be using the popular scikit-learn .

Note: There are few other packages as well like TensorFlow, Keras etc to perform supervised learning.

K-Nearest Neighbor Algorithm:

This algorithm is used to solve classification model problems ... The K-Nearest Neighbor or K-NN algorithm basically creates an imaginary boundary for classifying data. When new data points arrive, the algorithm will try to predict this to the closest border.

Therefore, a larger k value means softer partition curves, resulting in less complex models. Whereas a smaller k-value tends to outperform the data and lead to complex models.

Note. It is very important to have the correct k-value when analyzing the dataset to avoid overfitting and underfitting. fitting a dataset.

Using the k-nearest neighbor algorithm, we fit historical data (or train the model) and predict the future.

Example of k-nearest neighbor algorithm

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

 
# Loading data

irisData = load_iris ()

  
# Create objects and target arrays

X = irisData.data

y = irisData.target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

  X, y, test_size = 0.2 , random_state = 42 )

 

knn = KNeighborsClassifier (n_neighbors = 7 )

  
knn.fit (X_train, y_train)

 
# Predict a dataset that the model has not seen before

print (knn.predict (X_test))

The above example takes the following steps:

  1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
  2. Create feature and target variables.
  3. Divide the data into training and test data.
  4. Create a k-NN model using the value neighbors.
  5. Train or put the data in the model.
  6. Predict the future.

We have seen how we can use the K-NN algorithm to solve supervised machine learning problems. But how do you measure the accuracy of a model?

Consider the example below, where we predicted the performance of the above model:

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

  
# Loading data

irisData = load_iris ()

 
# Create objects and target arrays

X = irisData.data

y = irisData.target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.2 , random_state = 42 )

  

knn = KNeighborsClassifier (n_neighbors = 7 )

 
knn. fit (X_train, y_train)

 
# Calculate model accuracy

print (knn.score (X_test, y_test))

Model accuracy:
Everything is going fine. But how do you determine the correct k-value for a dataset? Obviously, we need to be familiar with the data in order to get the range of the expected k value, but to get the exact k value, we need to test the model for each expected k value. Refer to the example shown below.

# Import required modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

import numpy as np

import matplotlib.pyplot as plt

  

irisData = load_iris ()

 
# Creating Objects and Targets

X = irisData.data

y = irisData. target

 
# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.2 , random_state = 42 )

 

neighbors = np.arange ( 1 , 9 )

train_accuracy = np.empty ( len (neighbors))

test_accuracy = np.empty ( len (neighbors))

 
# Loop over K values ​​

for i, k in enumerate (neighbors):

knn = KNeighborsClassifier (n_neighbors = k)

  knn.fit (X_train, y_train)

  

# Calculate training and data validation accuracy

train_accuracy [i] = knn.score (X_train, y_train )

test_accuracy [i] = knn.score (X_test, y_test)

 
# Create plot

plt.plot (neighbors, test_accuracy, label = ’Testing dataset Accuracy’ )

plt.plot (neighbors, train_accuracy, label = ’Training dataset Accuracy’ )

  
plt.legend ()

plt.xlabel ( ’ n_neighbors’ )

plt.ylabel ( ’Accuracy’ )

plt.show ()

Output:

Here, in the example shown above, we create a graph to see the k value for which we have high precision.

Note. This is a method that is not used in the industry to select the correct value for n_neighbors. Instead, we tune the hyperparameter to select the value that provides the best performance. We will cover this in future posts.

Summary —
In this post, we understood what supervised learning is and what its categories are. With a basic understanding of supervised learning, we examined the k-nearest neighbor algorithm that is used to solve supervised machine learning problems. We also explored measuring the accuracy of the model.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically