ML | Label Encoding datasets in Python

Label encoding refers to converting labels to a numeric form in order to convert it to a machine-readable form. Machine learning algorithms can then better figure out how these labels should work. This is an important preprocessing step for a structured dataset in supervised learning.

Suppose we have a column in some dataset. 

After applying the label encoding, the Height column is converted to:

where 0 — label for tall, 1 — label for middle and 2 — for short stature.

We are applying the tag encoding to the iris dataset in the destination column, which is the View. Contains three species Iris-setosa, Iris-versicolor, Iris-virginica .

# Library import

import numpy as np

import pandas as pd

# Dataset Import

df = pd. read_csv ( `../../ data / Iris.csv` )


df [ `species` ]. unique ()


 array ([`Iris-setosa`,` Iris-versicolor`, `Iris-virginica`], dtype = object)  

After applying Label Encoding —

# Import the label encoder

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder ()

# Encode labels in the views column.

df [ `species` ] = label_encoder.fit_transform (df [ `species` ])


df [ `species` ]. unique ()


 array ([0, 1, 2], dtype = int64)  

Label constraint Encoding
An encoding label converts data into machine-readable form, but assigns a unique number (starting at 0) to each data class. This can lead to the formation of a priority problem when training datasets. A high value label is considered to have higher priority than a lower value label.


Attribute having output classes Mexico , Paris , Dubai . On the Coding label of this column, let mexico be replaced with 0 , Paris replaced with 1, and Dubai is replaced by 2.
It can be interpreted that Dubai has a higher priority when training the model than Mexico and Paris , but in fact there is no such priority relationship between these cities.