Binning in Data Mining

Python Methods and Functions

Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values ​​are divided into small ranges known as bins and then replaced by an overall calculated value for that bin. This has a smoothing effect on the input data and can also reduce the chances of overfitting in the case of small data sets.

There are 2 methods of dividing data into boxes:

  • Equal Frequency Binning: bins have an equal frequency.
  • Equal Width Binning : bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).

Equal Frequency binning

Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] 

Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]

Equal Width binning:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]



Implementation of Binning Technique

# equal frequency
def equifreq(arr1, m):   
    a = len(arr1)
    n = int(a / m)
    for i in range(0, m):
        arr = []
        for j in range(i * n, (i + 1) * n):
            if j >= a:
                break
            arr = arr + [arr1[j]]
        print(arr)
 
# equal width
def equiwidth(arr1, m):
    a = len(arr1)
    w = int((max(arr1) - min(arr1)) / m)
    min1 = min(arr1)
    arr = []
    for i in range(0, m + 1):
        arr = arr + [min1 + w * i]
    arri=[]
     
    for i in range(0, m):
        temp = []
        for j in arr1:
            if j >= arr[i] and j <= arr[i+1]:
                temp += [j]
        arri += [temp]
    print(arri)
 
# data to be binned
data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
 
# no of bins
m = 3
 
print("equal frequency binning")
equifreq(data, m)
 
print("\n\nequal width binning")
equiwidth(data, 3)

Output:

equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]


equal width binning
[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]] 



What is Data Binning?

Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Categorization groups related values ​​into categories to reduce the number of distinct values.

Categorization can improve resource utilization and model building response time dramatically, without significant loss in model quality. Categorization can improve the quality of the model, strengthening the relationship between attributes.

Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the limits of the bin. In supervised binning, the limits of the bin are identified by a single-predictor decision tree that takes into account the joint distribution with the destination. Supervised categorization can be used for numeric and categorical attributes.





Image data processing

In the context of image processing, binning is the process of combining a group of pixels into a single pixel. So, with 2x2 binning, the 4 pixel array becomes a larger pixel [1], decreasing the total number of pixels.

This aggregation, while associated with information loss, reduces the amount of data to be processed, thereby facilitating analysis. For example, grouping data can also reduce the effect of read noise on the processed image (at the cost of lower resolution).

Example of use

Histograms are an example of data aggregation used to observe the underlying distributions. They usually occur in one-dimensional space and at regular intervals for easy viewing.

Data fusion can be used when small instrumental shifts in spectral measurement of mass spectrometry (MS) or nuclear magnetic resonance (NMR) experiments are misinterpreted as representing different components when a set of data profiles is submitted. to a pattern recognition analysis. A simple way to solve this problem is to use clustering techniques that reduce spectral resolution just enough to ensure that a given peak stays in its bin despite small spectral shifts between analyzes. For example, in NMR, the chemical shift axis can be discretized and roughly divided into intervals, and in MS, spectral accuracies can be rounded to whole values ​​of atomic mass units. Additionally, some digital camera systems include automatic pixel grouping to improve image contrast.

Binning is also used in machine learning to accelerate a decision tree improvement method for supervised classification and regression in algorithms such as Microsoft LightGBM and the gradient amplification classification tree. based on the scikit-learn histogram.




Advantages (pros) of data smoothing

The data smoothing clarifies the understandability of various important hidden patterns in the data set. Data smoothing can be used to predict trends. Predictions are very helpful in making the right decisions at the right time.

Data smoothing helps to get accurate results from the data.




Disadvantages of data smoothing

Data smoothing does not always provide a clear explanation of the patterns between the data. It is possible for certain data points to be ignored by focusing the other data points.





Tutorials