Python | Binning method for data smoothing



The binning method is used to smooth data or process noisy data. In this method, the data is first sorted and then the sorted values ​​are spread across multiple segments or cells. Because binning methods refer to a neighborhood of values, they perform local smoothing.

There are three approaches to performing smoothing:

Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin median: In this method each bin value is replaced by its bin median value.
Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and maximum values ​​in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

Fit :

  1. Sort an array of a given dataset.
  2. Divides the range into N bins, each containing approximately the same number of samples (division by equal depth).
  3. Store the mean / median / bounds in each row.
  4. Examples :

     Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Smoothing by bin means:  - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries:  - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34  Smoothing by bin median :  - Bin 1: 9 9, 9, 9 - Bin 2: 24, 24, 24, 24 - Bin 3: 29, 29, 29, 29 

    Below is Python implementation for the above algorithm —

    import numpy as np 

    import math

    from sklearn.datasets import load_iris

    from sklearn import datasets, linear_model, metrics 

     
    # load iris dataset

    dataset = load_iris () 

    a = dataset.data

    b = np.zeros ( 150 )

     

     
    # take the 1st column among the 4 columns of the dataset

    for i in range ( 150 ):

    b [i] = a [i, 1

      

    b = np.sort ( b)  # sort array

     
    # create bins

    bin1 = np.zeros (( 30 , 5 )) 

    bin2 = np.zeros (( 30 , 5 ))

    bin3 = np.zeros (( 30 , 5 ))

     
    # Ben means

    for i in range ( 0 , 150 , 5 ):

    k = int (i / 5 )

    mean = (b [i] + b [i + 1 ] + b [i + 2 ] + b [i + 3 ] + b [i + 4 ]) / 5

      for j in range ( 5 ) :

    bin1 [k, j] = mean

    print ( "Bin Mean:" , bin1)

     
    # Border bin

    for i in range ( 0 , 150 , 5 ):

      k = int (i / 5 )

    for j in range ( 5 ):

    if (b [i + j] - b [i]) & lt; (b [i + 4 ] - b [i + j]):

    bin2 [k, j] = b [i]

    else :

    bin2 [k, j ] = b [i + 4

    print ( "Bin Boundaries:" , bin2)

     
    # Ben median

    for i in range ( 0 , 150 , 5 ):

    k = int (i / 5 )

      for j in range ( 5 ):

    bin3 [k, j] = b [i + 2 ]

    print ( "Bin Median:" , bin3)