ML | Handling Unbalanced Data with SMOTE and Near Miss Algorithm in Python



Standard ML methods like decision tree and logistic regression tend to deviate from the majority class and tend to ignore the minority class. They tend to only predict the class of the majority, hence have a significant misclassification of the minority class compared to the majority class. In other words, if there is an imbalanced distribution of the data in our dataset, then our model becomes more prone to the case where the minority class has little or very low recall .

Unbalanced Data Handling Techniques: There are mainly 2 algorithms that are widely used to handle unbalanced class distribution.

  1. Pierced
  2. Near Miss Algorithm
  3. SMOTE (Synthetic minority oversampling technique) — oversampling

    SMOTE (synthetic minority oversampling method) — one of the most commonly used oversampling techniques to solve imbalance problems.
    It aims to balance the distribution of classes by randomly increasing examples of minority classes by replicating them.
    SMOTE synthesizes new minority instances between existing minority instances. It generates virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more k-nearest neighbors for each example in the minority class. After the resampling process, the data is recovered and multiple classification models can be applied to the processed data.
    A deeper understanding of how the SMOTE algorithm works!

  4. Step 1: Setting the minority class A for each k-nearest neighbors of x are obtained by calculating the Euclidean distance between x and any other sample in the set A.
  5. Step 2: The sample rate N is set to match with an unbalanced ratio. For each , N examples (i.e. x1, x2, … xn) are selected randomly from k -the nearest neighbors, and they build a set of ,
  6. Step 3: For each example (k = 1, 2, 3 … N), the following formula is used to generate a new example:

    where rand (0, 1) represents a random number from 0 to 1.
  7. NearMiss Algorithm — sampling

    NearMiss — it is an undersampling method. It seeks to balance the distribution of classes by accidentally excluding examples of most classes. When instances of two different classes are very close to each other, we remove the majority class instances to increase the gaps between the two classes. This helps in the classification process.
    In order to prevent the problem of information loss in most undersampling methods, near-neighborhood methods are widely used.
    The basic intuition about how the nearest neighbor methods work is this:

  8. Step 1: The method first finds the distances between all instances of the majority class and the instances minority class. Here, the majority class should be underestimated.
  9. Step 2: Then n instances of the majority class are selected that have the closest distance to classes in the minority class.
  10. Step 3: If there are k instances in the minority class, the closest method will to k * n instances of the majority class.
  11. There are several ways to use the NearMiss algorithm to find the n closest instances in a majority class:

    1. NearMiss — version 1: it selects samples from the majority class for which the average distance to the k closest minority class instances is the smallest.
    2. NearMiss — version 2: it selects samples from the majority class for which the mean distance to the k farthest minority class instances is the smallest.
    3. NearMiss — version 3: works in 2 stages. First, for each instance of a minority class, their M nearest neighbors will be stored. Then, finally, instances of most of the classes are selected for which the average distance to the N nearest neighbors is greatest.

    This article helps you better understand and practice how to best choose between different unbalanced methods data processing.

    Download libraries and data file

    A dataset consists of transactions made using credit cards. This dataset contains 492 fraudulent transactions out of 284,807 transactions . This makes it extremely unbalanced, with the positive class (fraud) accounting for 0.172% of all transactions.
    The dataset can be downloaded here .

    # import the required modules

    import pandas as pd

    import matplotlib.pyplot as plt

    import numpy as np

    from sklearn.linear_model import LogisticRegression

    from sklearn.preprocessing import StandardScaler

    from sklearn.metrics import confusion_matrix, classification_report

     
    # load dataset

    data = pd.read_csv ( `creditcard.csv` )

     
    # print column information in data frame

    print (data.info ())

    Output :

     RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non- null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null loat64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float764 V25 -null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64 

    # normalize column sum

    data [ `normAmount` ] = StandardScaler (). fit_transform (np.array (data [ `Amount` ]). reshape ( - 1 , 1 ))

     
    # discard the Time and Amount columns as they have nothing to do with forecasting

    data = data.drop ([ `Time` , ` Amount` ], axis = 1 )

     
    # As you can see, there are 492 fraudulent transactions.

    data [ `Class` ]. value_counts ()

    Exit:

     0 284315 1 492  

    Separate data into test and training sets

    from sklearn.model_selection import train_test_split

     
    # divide by ration 70:30

    X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.3 , random_state = 0 )

     
    # describes information about the train and test set

    print ( "Number transactions X_train dataset: " , X_train.shape)

    print ( " Number transactions y_train dataset: " , y_train.shape)

    print ( "Number transactions X_test dataset:" , X_test.shape)

    print ( " Number transactions y_test dataset: " , y_test.shape)

    Exit :

     Number transactions X_train dataset: (199364, 29) Number transactions y_train dataset: (199364, 1 ) Number transactions X_test dataset: (85443, 29) Number transactions y_test dataset: (85443, 1) 

    Now train the model without handling unbalanced class distribution

    # logistic regression object

    lr = LogisticRegression ()

     
    # train the model for a train
    lr. fit (X_train, y_train.ravel ())

     

    predictions = lr.predict (X_test)

      
    # print classification report

    print (classification_report (y_test, predictions))

    Exit :

     precision recall f1-score support 0 1.00 1.00 1.00 85296 1 0.88 0.62 0.73 147 accuracy 1.00 85443 macro avg 0.94 0.81 0.86 85443 weighted avg 1.00 1.00 1.00 85443 

    Accuracy is 100%, but did you notice something strange?
    The feedback of the minority in the class is very less. This proves that the model is more inclined towards the majority class. So this proves that this is not the best model.
    We will now apply various unbalanced data processing methods to see their accuracy and remember the results.

    Using the SMOTE algorithm

    You can check all options here .

    print ( "Before OverSampling, counts of label `1`: {}" . format ( sum (y_train = = 1 )))

    print ( "Before OverSampling, counts of label` 0`: {} " . format ( sum (y_train = = 0 )))

     
    # import of the SMOTE module from imblearn library
    # pip install imblearn (if you don`t have imblearn on your system)

    from imblearn.over_sampling import SMOTE

    sm = SMOTE (random_state = 2 )

    X_train_res, y_train_res = sm.fit_sample (X_train, y_train.ravel ())

     

    print ( `After OverSampling, the shape of train_X: {} ` . format (X_train_res.shape))

    print ( `After OverSampling, the shape of train_y: {} ` . format (y_train_res.shape) )

     

    print ( " After OverSampling, counts of label `1`: {}" . format ( sum (y_train_res = = 1 )))

    print ( "After OverSampling, counts of label `0`: {}" . format ( sum (y_train_res = = 0 )))

    Exit:

     Before OverSampling, counts of label `1`: [345] Before OverSampling, counts of label `0`: [199019] After OverSampling, the shape of train_X: (398038, 29) After OverSampling, the shape of train_y: (398038,) After OverSampling, counts of label` 1`: 199019 After OverSampling , counts of label `0`: 199019 

    Look! this SMOTE algorithm rewrites took copies of the minority and made it equal to the majority class. Both categories have the same number of entries. In particular, the minority class was increased to the total number of the majority classes.
    Now look at the accuracy and remember the results after applying the SMOTE (Oversampling) algorithm.

    Forecasting and recall

    lr1 = LogisticRegression ()

    lr1.fit (X_train_res , y_train_res.ravel ())

    predictions = lr1.predict (X_test)

      
    # print classification report

    print (classification_report (y_test, predictions))

    Exit :

     precision recall f1-score support 0 1.00 0.98 0.99 85296 1 0.06 0.92 0.11 147 accuracy 0.98 85443 macro avg 0.53 0.95 0.55 85443 weighted avg 1.00 0.98 0.99 85443 

    Wow , we have reduced the accuracy to 98% over the previous model, but the minority class recall value has also improved to 92%. This is a good model compared to the previous one. Let us remind you that this is great.
    We will now use the NearMiss method to select a sample from the majority class to see its accuracy and recall the results.

    NearMiss algorithm:

    You can check all parameters here .

    print ( "Before Undersampling, counts of label` 1`: {} " . format ( sum (y_train = = 1 )))

    print ( "Before Undersampling, counts of label` 0`: {} " . format ( sum (y_train = = 0 )))

     
    # apply about miss

    from imblearn.under_sampling import NearMiss

    nr = NearMiss ()

     

    X_train_miss, y_train_miss = nr.fit_sample (X_train, y_train.ravel () )

     

    print ( `After Undersampling, the shape of train_X: {}` . format (X_train_miss.shape))

    print ( `After Undersampling, the shape of train_y: {}` . format (y_train_miss.shape))

     

    print ( " After Undersampling, counts of label `1`: {}" . format ( sum (y_train_miss = = 1 )))

    print ( "After Undersampling, counts of label` 0`: {} " . format ( sum (y_train_miss = = 0 )))

    Exit:

     Before Undersampling, counts of label ` 1`: [345] Before Undersampling, counts of label `0`: [199019] After Undersampling, the shape of train_X: (690, 29) After Undersampling, the shape of train_y: (690,) After Undersampling, counts of label `1`: 345 After Undersampling, counts of label` 0`: 345 

    Algorithm NearMiss underestimated the majority sample and made it equal to the majority class. Here the majority class has been reduced to the total number of minority classes, so that both classes will have an equal number of records.

    Forecast and recall

    # train the model for a train

    lr2 = LogisticRegression ()

    lr2.fit (X_train_miss, y_train_miss.ravel ())

    predictions = lr2.predict (X_test)

      
    # print classification report

    print (classification_report (y_test, predictions))

    Exit :

     precision recall f1-score support 0 1.00 0 .56 0.72 85296 1 0.00 0.95 0.01 147 accuracy 0.56 85443 macro avg 0.50 0.75 0.36 85443 weighted avg 1.00 0.56 0.72 85443 

    This model is better than the first model because it is better classified and also the minority recall value is 95%. But due to insufficient sampling of the majority class, his response dropped to 56%. So in this case SMOTE gives me a lot of accuracy and remember, I will use this model!