ML | One hot encoding of datasets in Python

NumPy | Python Methods and Functions

Sometimes in datasets we see columns that contain numbers in no particular order of preference. The data in a column usually denotes a category or category value, and also when the data in a column is encoded as a label. This is confusing machine learning model, to avoid this, the data in the column must be encoded in One Hot format.




One hot encoding —

This refers to splitting the column which contains numeric categorical data, into multiple columns depending on the number of categories present in that column. Each column contains a "0" or "1" under which it was placed.

For example:

Consider the data that lists fruits and their respective categorical values ​​and prices.

Fruit Categorical value of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Output after one hot coding of data is set as follows:

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

Below when implemented in Python —

Example 1:

The following example shows customer zones and credit ratings, zone — it is a categorical value that should be hot-coded.

# Program to demonstrate one hot coding

 
# importing libraries

import numpy as np

import pandas as pd

 
# import required data

data = pd.read_csv (r " ../ ../onehotenc_data.csv")

print (data)

Output:

For one hot coding of the column zone —

# import one hot encoder from sklearn

from sklearn.preprocessing import OneHotEncoder

 
# create one hot encoder object with categorical function 0
# specifying the first column

onehotencoder = OneHotEncoder (categorical_features = [ 0 ])

data = onehotencoder.fit_transform (data) .toarray ()

Output:

The output contains 5 columns, one column for price, and the remaining 4 columns represent 4 zones.

Example 2:

One hot encoder only accepts numeric categorical values, so any string type value must be encoded in the label before one hot encoder. 
The example below contains geography and customer field data that should be encoded first.

# import libraries

import numpy as np

import pandas as pds

 
# After importing the required data

print (data)

Output:

Label encoding data —

# label, code processing data

from sklearn.preprocessing import LabelEncoder

 

le = LabelEncoder ()

  

data [ `Gender` ] = le.fit_transform (data [ `Gender` ])

data [ `Geography` ] = le.fit_transform (data [ ` Geography` ])

Output:

One Hot Encoding Gender and Geography Columns —

# import one hot encoder from sklearn

from sklearn.preprocessing import OneHotEncoder

 
# create one default hot encoder object
# all data is hot

onehotencoder = OneHotEncoder ()

 

data = onehotencoder.fit_transform (data) .toarray () 

Output:

The output contains 5 columns, 2 columns representing gender, male and female, and the remaining 3 columns represent the countries France, Germany and Spain.

Notes:

  1. One hot encoder does not accept a one-dimensional array or a series of pandas, the input must always be two-dimensional .
  2. Data passed to the encoder must not contain strings.




Get Solution for free from DataCamp guru