Sometimes in datasets we see columns that contain numbers in no particular order of preference. The data in a column usually denotes a category or category value, and also when the data in a column is encoded as a label. This is confusing machine learning model, to avoid this, the data in the column must be encoded in One Hot format.
This refers to splitting the column which contains numeric categorical data, into multiple columns depending on the number of categories present in that column. Each column contains a “0” or “1” under which it was placed.
Consider the data that lists fruits and their respective categorical values and prices.
|Fruit||Categorical value of fruit||Price|
|apple||1||5 td >|
Output after one hot coding of data is set as follows:
Below when implemented in Python —
The following example shows customer zones and credit ratings, zone — it is a categorical value that should be hot-coded.
For one hot coding of the column zone —
The output contains 5 columns, one column for price, and the remaining 4 columns represent 4 zones.
One hot encoder only accepts numeric categorical values, so any string type value must be encoded in the label before one hot encoder.
The example below contains geography and customer field data that should be encoded first.
Label encoding data —
One Hot Encoding Gender and Geography Columns —
The output contains 5 columns, 2 columns representing gender, male and female, and the remaining 3 columns represent the countries France, Germany and Spain.