ML | Binning or Discretization

| | | | | | | | | | | | | | | | | | | | | | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

There are three data smoothing methods:

  1. Binning. Binning methods smooth the sorted data values ​​by referring to their "neighborhood", that is, the values ​​around them.
  2. Regression: matches function data values. Linear regression involves finding the "best‚" line to match two attributes (or variables) so that one attribute can be used to predict the other.
  3. Analyzing outliers : outliers can be detected by clustering, for example when similar values ​​are organized into groups or "clusters". Intuitively, values ​​that fall outside the cluster set can be considered outliers.

Binning method for data smoothing —
Here we are dealing with Binning’s method for data smoothing. In this method, the data is first sorted and then the sorted values ‚Äã‚Äãare spread across multiple segments or cells . Since binning methods refer to a neighborhood of values, they perform local smoothing.

There are basically two types of binning —

  1. Binning is the same width (or distance). The simplest approach is to divide the variable range into k intervals of equal width. Spacing width — it’s just the range [A, B] of the variable divided by k,
     w = (BA) / k 

    Thus, the interval of the i- th interval will be [A + (i-1) w, A + iw] where i = 1, 2, 3… ..k
    Skewed data cannot be handled well with this method.

  2. Binning of equal depth (or frequency): In binning of equal frequency, we divide the range [A, B] of a variable into intervals that contain (approximately) equal points; equal frequency may not be possible due to duplicate values.

How to smooth the data?

There are three approaches to performing smoothing:

  1. Bin smoothing mean s: when bin smoothing, each value in the bin is replaced by the bin’s mean .
  2. Bin-mean -median-mode-in-python-without-libraries/">median smoothing: in this method each bin value is replaced by its bin mean -median-mode-in-python-without-libraries/">median value.
  3. Bin Smoothing: When bin boundary smoothing, the minimum and maximum values ​​in a given bin are defined as bin boundaries. Each bin value is then replaced with the closest cutoff value.

Sorted data by price (in dollars): 2, 6, 7, 9, 13, 20, 21, 25, 30

 Partition using equal frequency approach: Bin 1: 2, 6, 7 Bin 2: 9, 13, 20 Bin 3: 21, 24, 30 Smoothing by bin  mean : Bin 1: 5, 5, 5 Bin 2: 14 , 14, 14 Bin 3: 25, 25, 25 Smoothing by bin mean -median-mode-in-python-without-libraries/">median: Bin 1: 6, 6, 6 Bin 2: 13, 13, 13 Bin 3: 24, 24, 24 Smoothing by bin boundary: Bin 1: 2 , 7, 7 Bin 2: 9, 9, 20 Bin 3: 21, 21, 30 

Binning can also be used as a sampling method ... Here discretization refers to the process of transforming or breaking down continuous attributes, features, or variables into discrete or nominal attributes / features / variables / intervals.
For example, attribute values ​​can be sampled by applying equal width or equal frequency binning and then replacing each bin value with the mean or mean -median-mode-in-python-without-libraries/">median bin, as in antialiasing by mean bin value or smoothing by bin mean -median-mode-in-python-without-libraries/">medians, respectively. Continuous values ​​can then be converted to a nominal or sampled value that matches the corresponding bin value.

Below is the Python implementation:

bin_ mean

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

import statistics

import math

from collections import OrderedDict

x = []

print ( "enter the data" )

x = list ( map ( float , input (). split ()))

print ( "enter the number of bins" )

bi = int ( input ())


# X_dict will store data in sorted order

X_dict = OrderedDict ()

# x_old will store the original data

x_old = {}

# x_new will store data after binning

x_new = {}

for i in range ( len (x)):

X_dict [i] = x [i]

x_old [i] = x [i]

x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ])


# list to lists (bins)

binn = []

# variable to find the average of each bin

avrg = 0

i = 0

k = 0

num_of_data_in_each_bin = int (math.ceil ( len (x) / bi))


# executing binning

for g, h in X_dict.items ():

if (i & lt; num_of_data_in_each_bin):

avrg = avrg + h

i = i + 1

elif (i = = num_of_data_in_each_bin):

k = k + 1

i = 0

binn.append ( round (avrg / num_of_data_in_each_bin, 3 ))

avrg = 0

avrg = avrg + h

i = i + 1

rem = len (x) % bi

if (rem = = 0 ):

binn.append ( round (avrg / num_of_data_in_each_bin, 3 ))

else :

binn.append ( round (avrg / rem, 3 ))


# save the new value of each data

i = 0

j = 0

for g, h in X_dict.items ():

if (i & lt; num_of_data_in_each_bin):

x_new [g] = binn [j]

i = i + 1

else :

i = 0

j = j + 1

x_new [g] = binn [j]

i = i + 1

print ( " number of data in each bin " )

print (math.ceil ( le n (x) / bi))

for i in range ( 0 , len (x)):

print ( ’index {2} old value {0} new value {1}’ . format (x_old [i], x_new [i], i))

bin_mean -median-mode-in-python-without-libraries/">median

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

import statistics

import math

from collections import OrderedDict

x = []

print ( "enter the data" )

x = list ( map ( float , input (). split ()))

print ( " enter the number of bins " )

bi = int ( input ())


# X_dict will store data in sorted order

X_dict = OrderedDict ()

# x_old will store the original data

x_old = {}

# x_new will store data after binning

x_new = {}

for i in range ( len (x)) :

X_dict [i] = x [i]

x_old [ i] = x [i]

x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ])


# list of lists (bins)

binn = []

# variable to find the average of each bin

avrg = []

i = 0

k = 0

num_of_data_in_each_bin = int (math.ceil ( len (x) / bi))

# executing binning

for g, h in X_dict.items ():

if (i & lt; num_of_data_in_each_bin):

avrg.append (h)

i = i + 1

elif (i = = num_of_data_in_each_bin):

k = k + 1

i = 0

binn. append (statistics.mean -median-mode-in-python-without-libraries/">median (avrg))

avrg = []

avrg.append (h)

i = i + 1


binn.append (statistics.mean -median-mode-in-python-without-libraries/">median (avrg))


# save the new value of each of the data

i = 0

j = 0

for g, h in X_dict.items ():

if (i & lt; num_of_data_in_each_bin):

x_new [g] = round (binn [j], 3 )

i = i + 1

else :

i = 0

j = j + 1

x_new [g] = round (binn [j], 3 )

i = i + 1

print ( "number of data in each bin" )

print (math.ceil ( len (x) / bi))

for i in range ( 0 , len (x)):

print ( ’index {2} old value {0} new value {1} ’ . format (x_old [i], x_new [i], i))

bin_boundary

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

import statistics

import math

from collections import OrderedDict

x = []

print ( "enter the data" )

x = list ( map ( float , input (). split ()))

print ( " enter the number of bins " )

bi = int ( input ())


# X_dict will store data in sorted order

X_dict = OrderedDict ()

# x_old will store the original data

x_old = {}

# x_new will store data after binning

x_new = {}

for i in range ( len ( x)):

X_dict [i] = x [i]

x_old [i] = x [i]

x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ] )


# list of lists (bins)

binn = []

# variable to find the average of each bin

avrg = []

i = 0

k = 0

num_of_data_in_each_bin = int (math.ceil ( len (x) / bi))

for g, h in X_dict.items ():

if (i & lt; num_of_data_in_each_bin) :

avrg.append (h)

i = i + 1

elif (i = = num_of_data_in_each_bin):

k = k + 1

i = 0

code class = "undefined spaces"> x_old [i] = x [i]

x_dict = sorted (X_dict.items (), key = lambda x: x [ 1 ])

laptop for engineering students?

ML | Binning or Discretization __del__: Questions

How can I make a time delay in Python?

5 answers

I would like to know how to put a time delay in a Python script.

2973

Answer #1

import time
time.sleep(5)   # Delays for 5 seconds. You can also use a float value.

Here is another example where something is run approximately once a minute:

import time
while True:
    print("This prints once a minute.")
    time.sleep(60) # Delay for 1 minute (60 seconds).

2973

Answer #2

You can use the sleep() function in the time module. It can take a float argument for sub-second resolution.

from time import sleep
sleep(0.1) # Time in seconds

ML | Binning or Discretization __del__: Questions

How to delete a file or folder in Python?

5 answers

How do I delete a file or folder in Python?

2639

Answer #1


Path objects from the Python 3.4+ pathlib module also expose these instance methods:

We hope this article has helped you to resolve the problem. Apart from ML | Binning or Discretization, check other __del__-related topics.

Want to excel in Python? See our review of the best Python online courses 2023. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:



Angelo Porretti

Munchen | 2023-03-25

Maybe there are another answers? What ML | Binning or Discretization exactly means?. Will use it in my bachelor thesis

Manuel OConnell

Tallinn | 2023-03-25

Maybe there are another answers? What ML | Binning or Discretization exactly means?. I am just not quite sure it is the best method

Angelo Lehnman

Massachussetts | 2023-03-25

Simply put and clear. Thank you for sharing. ML | Binning or Discretization and other issues with StackOverflow was always my weak point 😁. Will get back tomorrow with feedback

Shop

Gifts for programmers

Learn programming in R: courses

$FREE
Gifts for programmers

Best Python online courses for 2022

$FREE
Gifts for programmers

Best laptop for Fortnite

$399+
Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best computer for crypto mining

$499+
Gifts for programmers

Best laptop for Sims 4

$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically