Change language

ML | One hot encoding of datasets in Python

Sometimes in datasets we see columns that contain numbers in no particular order of preference. The data in a column usually denotes a category or category value, and also when the data in a column is encoded as a label. This is confusing machine learning model, to avoid this, the data in the column must be encoded in One Hot format.

One hot encoding —

This refers to splitting the column which contains numeric categorical data, into multiple columns depending on the number of categories present in that column. Each column contains a "0" or "1" under which it was placed.

For example:

Consider the data that lists fruits and their respective categorical values ​​and prices.

Fruit Categorical value of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Output after one hot coding of data is set as follows:

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

Below when implemented in Python —

Example 1:

The following example shows customer zones and credit ratings, zone — it is a categorical value that should be hot-coded.

# Program to demonstrate one hot coding

 
# importing libraries

import numpy as np

import pandas as pd

 
# import required data

data = pd.read_csv (r " ../ ../onehotenc_data.csv")

print (data)

Output:

For one hot coding of the column zone —

# import one hot encoder from sklearn

from sklearn.preprocessing import OneHotEncoder

 
# create one hot encoder object with categorical function 0
# specifying the first column

onehotencoder = OneHotEncoder (categorical_features = [ 0 ])

data = onehotencoder.fit_transform (data) .toarray ()

Output:

The output contains 5 columns, one column for price, and the remaining 4 columns represent 4 zones.

Example 2:

One hot encoder only accepts numeric categorical values, so any string type value must be encoded in the label before one hot encoder. 
The example below contains geography and customer field data that should be encoded first.

# import libraries

import numpy as np

import pandas as pds

 
# After importing the required data

print (data)

Output:

Label encoding data —

# label, code processing data

from sklearn.preprocessing import LabelEncoder

 

le = LabelEncoder ()

  

data [ ’Gender’ ] = le.fit_transform (data [ ’Gender’ ])

data [ ’Geography’ ] = le.fit_transform (data [ ’ Geography’ ])

Output:

One Hot Encoding Gender and Geography Columns —

# import one hot encoder from sklearn

from sklearn.preprocessing import OneHotEncoder

 
# create one default hot encoder object
# all data is hot

onehotencoder = OneHotEncoder ()

 

data = onehotencoder.fit_transform (data) .toarray () 

Output:

The output contains 5 columns, 2 columns representing gender, male and female, and the remaining 3 columns represent the countries France, Germany and Spain.

Notes:

  1. One hot encoder does not accept a one-dimensional array or a series of pandas, the input must always be two-dimensional .
  2. Data passed to the encoder must not contain strings.

Shop

Gifts for programmers

Learn programming in R: courses

$FREE
Gifts for programmers

Best Python online courses for 2022

$FREE
Gifts for programmers

Best laptop for Fortnite

$399+
Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best computer for crypto mining

$499+
Gifts for programmers

Best laptop for Sims 4

$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically