Change language

# ML | Handling Unbalanced Data with SMOTE and Near Miss Algorithm in Python

| |

Standard ML methods like decision tree and logistic regression tend to deviate from the majority class and tend to ignore the minority class. They tend to only predict the class of the majority, hence have a significant misclassification of the minority class compared to the majority class. In other words, if there is an imbalanced distribution of the data in our dataset, then our model becomes more prone to the case where the minority class has little or very low recall .

Unbalanced Data Handling Techniques: There are mainly 2 algorithms that are widely used to handle unbalanced class distribution.

1. Pierced
2. Near Miss Algorithm
3. ## SMOTE (Synthetic minority oversampling technique) — oversampling

SMOTE (synthetic minority oversampling method) — one of the most commonly used oversampling techniques to solve imbalance problems.
It aims to balance the distribution of classes by randomly increasing examples of minority classes by replicating them.
SMOTE synthesizes new minority instances between existing minority instances. It generates virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more k-nearest neighbors for each example in the minority class. After the resampling process, the data is recovered and multiple classification models can be applied to the processed data.
A deeper understanding of how the SMOTE algorithm works!

4. Step 1: Setting the minority class A for each k-nearest neighbors of x are obtained by calculating the Euclidean distance between x and any other sample in the set A.
5. Step 2: The sample rate N is set to match with an unbalanced ratio. For each , N examples (i.e. x1, x2, ... xn) are selected randomly from k -the nearest neighbors, and they build a set of ,
6. Step 3: For each example (k = 1, 2, 3 ... N), the following formula is used to generate a new example:

where rand (0, 1) represents a random number from 0 to 1.
7. ## NearMiss Algorithm — sampling

NearMiss — it is an undersampling method. It seeks to balance the distribution of classes by accidentally excluding examples of most classes. When instances of two different classes are very close to each other, we remove the majority class instances to increase the gaps between the two classes. This helps in the classification process.
In order to prevent the problem of information loss in most undersampling methods, near-neighborhood methods are widely used.
The basic intuition about how the nearest neighbor methods work is this:

8. Step 1: The method first finds the distances between all instances of the majority class and the instances minority class. Here, the majority class should be underestimated.
9. Step 2: Then n instances of the majority class are selected that have the closest distance to classes in the minority class.
10. Step 3: If there are k instances in the minority class, the closest method will to k * n instances of the majority class.
11. There are several ways to use the NearMiss algorithm to find the n closest instances in a majority class:

1. NearMiss — version 1: it selects samples from the majority class for which the average distance to the k closest minority class instances is the smallest.
2. NearMiss — version 2: it selects samples from the majority class for which the mean distance to the k farthest minority class instances is the smallest.
3. NearMiss — version 3: works in 2 stages. First, for each instance of a minority class, their M nearest neighbors will be stored. Then, finally, instances of most of the classes are selected for which the average distance to the N nearest neighbors is greatest.

This article helps you better understand and practice how to best choose between different unbalanced methods data processing.

A dataset consists of transactions made using credit cards. This dataset contains 492 fraudulent transactions out of 284,807 transactions . This makes it extremely unbalanced, with the positive class (fraud) accounting for 0.172% of all transactions.

 ` # import the required modules ` ` import ` ` pandas as pd ` ` import ` ` matplotlib.pyplot as plt ` ` import ` ` numpy as np ` ` from ` ` sklearn.linear_model ` ` import ` ` LogisticRegression ` ` from ` ` sklearn.preprocessing ` ` import ` ` StandardScaler ` ` from ` ` sklearn.metrics ` ` import ` ` confusion_matrix, classification_report `   ` # load dataset ` ` data ` ` = ` ` pd.read_csv (` ` ’creditcard.csv’ ` `) `   ` # print column information in data frame ` ` print ` ` (data.info ()) `

Output :

` RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non- null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null loat64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float764 V25 -null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64 `

` `

` # normalize column sum data [ ’normAmount’ ] = StandardScaler (). fit_transform (np.array (data [ ’Amount’ ]). reshape ( - 1 , 1 ))   # discard the Time and Amount columns as they have nothing to do with forecasting data = data.drop ([ ’Time’ , ’ Amount’ ], axis = 1 )   # As you can see, there are 492 fraudulent transactions. data [ ’Class’ ]. value_counts () `

` `

Exit:

` 0 284315 1 492 `

### Separate data into test and training sets

 ` from ` ` sklearn.model_selection ` ` import ` ` train_test_split `   ` # divide by ration 70:30 ` ` X_train, X_test, y_train, y_test ` ` = ` ` train_test_split (X, y, test_size ` ` = ` ` 0.3 ` `, random_state ` ` = ` ` 0 ` ` ) `   ` # describes information about the train and test set ` ` print ` ` (` ` "Number transactions X_train dataset: "` `, X_train.shape) ` ` print ` ` (` `" Number transactions y_train dataset: "` `, y_train.shape) ` ` print ` ` (` ` "Number transactions X_test dataset:" ` `, X_test.shape) ` ` print ` ` (` `" Number transactions y_test dataset: "` `, y_test.shape) `

Exit :

` Number transactions X_train dataset: (199364, 29) Number transactions y_train dataset: (199364, 1 ) Number transactions X_test dataset: (85443, 29) Number transactions y_test dataset: (85443, 1) `

### Now train the model without handling unbalanced class distribution

 ` # logistic regression object ` ` lr ` ` = ` ` LogisticRegression () `   ` # train the model for a train ` ` lr. fit (X_train, y_train.ravel ()) `   ` predictions ` ` = ` ` lr.predict (X_test) ` ` `  ` # print classification report ` ` print ` ` (classification_report (y_test, predictions)) `

Exit :

` precision recall f1-score support 0 1.00 1.00 1.00 85296 1 0.88 0.62 0.73 147 accuracy 1.00 85443 macro avg 0.94 0.81 0.86 85443 weighted avg 1.00 1.00 1.00 85443 `

Accuracy is 100%, but did you notice something strange?
The feedback of the minority in the class is very less. This proves that the model is more inclined towards the majority class. So this proves that this is not the best model.
We will now apply various unbalanced data processing methods to see their accuracy and remember the results.

### Using the SMOTE algorithm

You can check all options here .

` `

` print ( "Before OverSampling, counts of label ’1’: {}" . format ( sum (y_train = = 1 ))) print ( "Before OverSampling, counts of label’ 0’: {} " . format ( sum (y_train = = 0 )))   # import of the SMOTE module from imblearn library # pip install imblearn (if you don’t have imblearn on your system) from imblearn.over_sampling import SMOTE sm = SMOTE (random_state = 2 ) X_train_res, y_train_res = sm.fit_sample (X_train, y_train.ravel ())   print ( ’After OverSampling, the shape of train_X: {} ’ . format (X_train_res.shape)) print ( ’After OverSampling, the shape of train_y: {} ’ . format (y_train_res.shape) )   print ( " After OverSampling, counts of label ’1’: {}" . format ( sum (y_train_res = = 1 ))) print ( "After OverSampling, counts of label ’0’: {}" . format ( sum (y_train_res = = 0 ))) `

` `

Exit:

` Before OverSampling, counts of label ’1’: [345] Before OverSampling, counts of label ’0’: [199019] After OverSampling, the shape of train_X: (398038, 29) After OverSampling, the shape of train_y: (398038,) After OverSampling, counts of label’ 1’: 199019 After OverSampling , counts of label ’0’: 199019 `

Look! this SMOTE algorithm rewrites took copies of the minority and made it equal to the majority class. Both categories have the same number of entries. In particular, the minority class was increased to the total number of the majority classes.
Now look at the accuracy and remember the results after applying the SMOTE (Oversampling) algorithm.

#### Forecasting and recall

 ` lr1 ` ` = ` ` LogisticRegression () ` ` lr1.fit (X_train_res , y_train_res.ravel ()) ` ` predictions ` ` = ` ` lr1.predict (X_test) ` ` `  ` # print classification report ` ` print ` ` (classification_report (y_test, predictions)) `

Exit :

` precision recall f1-score support 0 1.00 0.98 0.99 85296 1 0.06 0.92 0.11 147 accuracy 0.98 85443 macro avg 0.53 0.95 0.55 85443 weighted avg 1.00 0.98 0.99 85443 `

Wow , we have reduced the accuracy to 98% over the previous model, but the minority class recall value has also improved to 92%. This is a good model compared to the previous one. Let us remind you that this is great.
We will now use the NearMiss method to select a sample from the majority class to see its accuracy and recall the results.

### NearMiss algorithm:

You can check all parameters here .

 ` print ` ` (` ` "Before Undersampling, counts of label’ 1’: {} "` `. ` ` format ` ` (` ` sum ` ` (y_train ` ` = ` ` = ` ` 1 ` `))) ` ` print ` ` (` ` "Before Undersampling, counts of label’ 0’: {} "` `. ` ` format ` ` (` ` sum ` ` (y_train ` ` = ` ` = ` ` 0 ` `))) `   ` # apply about miss ` ` from ` ` imblearn.under_sampling ` ` import ` ` NearMiss ` ` nr ` ` = ` ` NearMiss () `   ` X_train_miss, y_train_miss ` ` = ` ` nr.fit_sample (X_train, y_train.ravel () ) `   ` print ` ` (` ` ’After Undersampling, the shape of train_X: {}’ ` `. ` ` format ` ` (X_train_miss.shape)) ` ` print ` ` (` ` ’After Undersampling, the shape of train_y: {}’ ` `. ` ` format ` ` (y_train_miss.shape)) `   ` print ` ` (` `" After Undersampling, counts of label ’1’: {}" ` `. ` ` format ` ` (` ` sum ` ` (y_train_miss ` ` = ` ` = ` ` 1 ` `))) ` ` print ` ` (` ` "After Undersampling, counts of label’ 0’: {} "` `. ` ` format ` ` (` ` sum ` ` (y_train_miss ` ` = ` ` = ` ` 0 ` `))) `

Exit:

` Before Undersampling, counts of label ’ 1’: [345] Before Undersampling, counts of label ’0’: [199019] After Undersampling, the shape of train_X: (690, 29) After Undersampling, the shape of train_y: (690,) After Undersampling, counts of label ’1’: 345 After Undersampling, counts of label’ 0’: 345 `

Algorithm NearMiss underestimated the majority sample and made it equal to the majority class. Here the majority class has been reduced to the total number of minority classes, so that both classes will have an equal number of records.

#### Forecast and recall

 ` # train the model for a train ` ` lr2 ` ` = ` ` LogisticRegression () ` ` lr2.fit (X_train_miss, y_train_miss.ravel ()) ` ` predictions ` ` = ` ` lr2.predict (X_test) ` ` `  ` # print classification report ` ` print ` ` (classification_report (y_test, predictions)) `

Exit :

` precision recall f1-score support 0 1.00 0 .56 0.72 85296 1 0.00 0.95 0.01 147 accuracy 0.56 85443 macro avg 0.50 0.75 0.36 85443 weighted avg 1.00 0.56 0.72 85443 `

This model is better than the first model because it is better classified and also the minority recall value is 95%. But due to insufficient sampling of the majority class, his response dropped to 56%. So in this case SMOTE gives me a lot of accuracy and remember, I will use this model!

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

NUMPYNUMPY

glob exclude pattern

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

NUMPYNUMPY

Python CSV error: line contains NULL byte

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries