  # ML | Binning or Discretization

NumPy | Python Methods and Functions

There are three data smoothing methods:

1. Binning. Binning methods smooth the sorted data values ​​by referring to their "neighborhood", that is, the values ​​around them.
2. Regression: matches function data values. Linear regression involves finding the “best” line to match two attributes (or variables) so that one attribute can be used to predict the other.
3. Analyzing outliers : outliers can be detected by clustering, for example when similar values ​​are organized into groups or "clusters". Intuitively, values ​​that fall outside the cluster set can be considered outliers.

Binning method for data smoothing —
Here we are dealing with Binning`s method for data smoothing. In this method, the data is first sorted and then the sorted values ​​are spread across multiple segments or cells . Since binning methods refer to a neighborhood of values, they perform local smoothing.

There are basically two types of binning —

1. Binning is the same width (or distance). The simplest approach is to divide the variable range into k intervals of equal width. Spacing width — it`s just the range [A, B] of the variable divided by k,
` w = (BA) / k `

Thus, the interval of the i- th interval will be ` [A + (i-1) w, A + iw] ` where i = 1, 2, 3… ..k
Skewed data cannot be handled well with this method.

2. Binning of equal depth (or frequency): In binning of equal frequency, we divide the range [A, B] of a variable into intervals that contain (approximately) equal points; equal frequency may not be possible due to duplicate values.

#### How to smooth the data?

There are three approaches to performing smoothing:

1. Bin smoothing means: when bin smoothing, each value in the bin is replaced by the bin`s mean.
2. Bin-median smoothing: in this method each bin value is replaced by its bin median value.
3. Bin Smoothing: When bin boundary smoothing, the minimum and maximum values ​​in a given bin are defined as bin boundaries. Each bin value is then replaced with the closest cutoff value.

Sorted data by price (in dollars): 2, 6, 7, 9, 13, 20, 21, 25, 30

` Partition using equal frequency approach: Bin 1: 2, 6, 7 Bin 2: 9, 13, 20 Bin 3: 21, 24, 30 Smoothing by bin mean: Bin 1: 5, 5, 5 Bin 2: 14 , 14, 14 Bin 3: 25, 25, 25 Smoothing by bin median: Bin 1: 6, 6, 6 Bin 2: 13, 13, 13 Bin 3: 24, 24, 24 Smoothing by bin boundary: Bin 1: 2 , 7, 7 Bin 2: 9, 9, 20 Bin 3: 21, 21, 30 `

Binning can also be used as a sampling method ... Here discretization refers to the process of transforming or breaking down continuous attributes, features, or variables into discrete or nominal attributes / features / variables / intervals.
For example, attribute values ​​can be sampled by applying equal width or equal frequency binning and then replacing each bin value with the mean or median bin, as in antialiasing by mean bin value or smoothing by bin medians, respectively. Continuous values ​​can then be converted to a nominal or sampled value that matches the corresponding bin value.

Below is the Python implementation:

bin_mean

 ` import ` ` numpy as np ` ` from ` ` sklearn.linear_model ` ` import ` ` LinearRegression ` ` from ` ` sklearn ` ` import ` ` linear_model ` ` # import statsmodels.api as sm ` ` import ` ` statistics ` ` import ` ` math ` ` from ` ` collections ` ` import ` ` OrderedDict `   ` x ` ` = ` ` [] ` ` print ` ` (` ` "enter the data" ` `) ` ` x ` ` = ` ` list ` ` (` ` map ` ` (` ` float ` `, ` ` input ` ` (). split ())) `   ` print ` ` (` ` "enter the number of bins" ` `) ` ` bi ` ` = ` ` int ` ` (` ` input ` ` ()) `   ` # X_dict will store data in sorted order ` ` X_dict ` ` = ` ` OrderedDict () ` ` # x_old will store the original data ` ` x_old ` ` = ` ` {} ` ` # x_new will store data after binning ` ` x_new ` ` = ` ` {} `     ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` (x)): ` ` X_dict [i] ` ` = ` ` x [i] ` ` x_old [i] ` ` = ` ` x [i] `   ` x_dict ` ` = ` ` sorted ` ` (X_dict.items (), key ` ` = ` ` lambda ` ` x: x [` ` 1 ` `]) `   ` # list to lists (bins) ` ` binn ` ` = ` ` [] ` ` # variable to find the average of each bin ` ` avrg ` ` = ` ` 0 `   ` i ` ` = ` ` 0 ` ` k ` ` = ` ` 0 ` ` num_of_data_in_each_bin ` ` = ` ` int ` ` (math.ceil (` ` len ` ` (x) ` ` / ` ` bi)) `   ` # executing binning ` ` for ` ` g, h ` ` in ` ` X_dict.items (): ` ` if ` ` (i & lt; num_of_data_in_each_bin): ` ` avrg ` ` = ` ` avrg ` ` + ` ` h ` ` i ` ` = ` ` i ` ` + ` ` 1 ` ` elif ` ` (i ` ` = ` ` = ` ` num_of_data_in_each_bin): ` ` k ` ` = ` ` k ` ` + ` ` 1 ` ` i = 0 `` binn.append ( round (avrg / num_of_data_in_each_bin, 3 )) avrg = 0 avrg = avrg + h   i = i + 1 rem = len (x) % bi if (rem = = 0 ):   binn.append ( round (avrg / num_of_data_in_each_bin, 3 )) else : binn.append ( round (avrg / rem, 3 ))   # save the new value of each data i = 0 j = 0 for g, h in X_dict.items ():   if (i & lt; num_of_data_in_each_bin): x_new [g] = binn [j] i = i + 1 else : i = 0 j = j + 1 x_new [g] = binn [j] i = i + 1 print ( " number of data in each bin " ) print (math.ceil ( le n (x) / bi))   for i in range ( 0 , len (x)): print ( `index {2} old value {0} new value {1}` . format (x_old [i], x_new [i], i)) `

bin_median

` `

` import numpy as np from sklearn.linear_model import LinearRegression from sklearn import linear_model # import statsmodels.api as sm import statistics import math from collections import OrderedDict     x = [] print ( "enter the data" ) x = list ( map ( float , input (). split ()))    print ( " enter the number of bins " ) bi = int ( input ())    # X_dict will store data in sorted order X_dict = OrderedDict () # x_old will store the original data x_old = {} # x_new will store data after binning x_new = {}   for i in range ( len (x)) : X_dict [i] = x [i] x_old [ i] = x [i]    x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ])     # list of lists (bins) binn = [] # variable to find the average of each bin avrg = []   i = 0 k = 0 num_of_data_in_each_bin = int (math.ceil ( len (x) / bi)) # executing binning for g, h in X_dict.items (): if (i & lt; num_of_data_in_each_bin): avrg.append (h) i = i + 1   elif (i = = num_of_data_in_each_bin): k = k + 1 i = 0 binn. append (statistics.median (avrg)) avrg = [] avrg.append (h) i = i + 1   binn.append (statistics.median (avrg))   # save the new value of each of the data i = 0 j = 0 for g, h in X_dict.items (): if (i & lt; num_of_data_in_each_bin): x_new [g] = round (binn [j], 3 ) i = i + 1 else : i = 0 j = j + 1 x_new [g] = round (binn [j], 3 ) i = i + 1   print ( "number of data in each bin" ) print (math.ceil ( len (x) / bi)) for i in range ( 0 , len (x)):   print ( `index {2} old value {0} new value {1} ` . format (x_old [i], x_new [i], i)) `

` ` bin_boundary

` `

``` import numpy as np from sklearn.linear_model import LinearRegression from sklearn import linear_model # import statsmodels.api as sm import statistics import math from collections import OrderedDict   x = [] print ( "enter the data" ) x = list ( map ( float , input (). split ()))   print ( " enter the number of bins " ) bi = int ( input ())   # X_dict will store data in sorted order X_dict = OrderedDict () # x_old will store the original data x_old = {} # x_new will store data after binning x_new = {}     for i in range ( len ( x)): X_dict [i] = x [i]   x_old [i] = x [i]    x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ] )   # list of lists (bins) binn = [] # variable to find the average of each bin avrg = []   i = 0 k = 0 num_of_data_in_each_bin = int (math.ceil ( len (x) / bi))   for g, h in X_dict.items (): if (i & lt; num_of_data_in_each_bin) : avrg.append (h) i = i + 1 elif (i = = num_of_data_in_each_bin): k = k + 1 i = 0 code class = "undefined spaces">  x_old [i] = x [i]   x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ]) (adsbygoogle = window.adsbygoogle || []).push({}); Books for developers Machine Learning in Finance: From Theory to Practice This book introduces machine learning methods in finance. It features a unified treatment of machine learn... 12/08/2021 INTRODUCTION TO NUMERICAL PROGRAMMING Taking into account the development of modern programming, especially the emerging programming languages ​​that reflect modern practice, Numerical Programming: A Practical Guide for Scientists and... 08/08/2021 Raspberry Pi For Dummies 4th Edition A recipe for having fun and getting things done with the Raspberry Pi ... 12/08/2021 Python: The Bible Python: - The Bible - 3 Manuscripts in 1 book: Python Programming For Beginners Python Programming For Intermediates Python Programming for Advanced ... 12/08/2021 Get Solution for free from DataCamp guru © 2021 Python.Engineering Best Python tutorials books for beginners and professionals Python.Engineering is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com Computations Development Cryptography For dummies Machine Learning Big Data Loops Counters NumPy NLP PHP Regular Expressions File Handling Arrays String Variables Knowledge Database X Submit new EBook \$(document).ready(function () { \$(".modal_galery").owlCarousel({ items: 1, itemsCustom: false, itemsDesktop: [1300, 1], itemsDesktopSmall: [960, 1], itemsTablet: [768, 1], itemsTabletSmall: false, itemsMobile: [479, 1], singleItem: false, itemsScaleUp: false, pagination: false, navigation: true, rewindNav: true, autoPlay: true, stopOnHover: true, navigationText: [ "<img class='img_no_nav_mob' src='/wp-content/themes/nimani/image/prevCopy.png'>", "<img class='img_no_nav_mob' src='/wp-content/themes/nimani/image/nextCopy.png'>" ], }); \$(".tel_mask").mask("+9(999) 999-99-99"); }) ```