+

ML | Mathematical explanation of standard deviation and R-squared error

RMSE: root mean square error — it is a measure of how well the regression line fits the data points. RMSE can also be interpreted as the standard deviation of the residual.
Consider these points: (1, 1), (2, 2), (2, 3), (3, 6).
We split the above data points into the 1st list.
Login :

  x =  [1, 2, 2, 3]  y =  [1, 2, 3, 6] 

Code: Regression Plot

import matplotlib.pyplot as plt 

import math

 
# dots
plt. plot (x, y) 

 
# x-axis title

plt.xlabel ( `x - axis`

 
# Y-axis name

plt.ylabel ( `y - axis`

 
# giving a title to my graphic

plt.title ( `Regression Graph`

  
# plot show function
plt.show () 


Code: Average

# in the next step we will find the equation of the line of best fit
# we will use the slope of a linear algebra point to find the equation of the regression line
# the shape of the slope is represented by y = mx + c
# where m means slope (change in y) / (change in x)
# c is a constant, it represents where the line will cross the Y-axis
# Slope m can be formulated like this:
"" "

  N

m =? (xi - Xmean) (yi - Ymean) /? (xi - Xmean) ^ 2

i = 1

"" "
# calculate Xmean and Ymean

ct = len (x)

sum_x = 0

sum_y = 0

 

for i in x:

  sum_x = sum_x + i

x_mean = sum_x / ct

print ( ` Value of X mean` , x_mean)

 

for i in y:

sum_y = sum_y + i

y_mean = sum_y / ct

print ( `value of Y mean` , y_mean)

 
# we have x mean and y_mean

Output:

 Value of X mean 2.0 value of Y mean 3.0 

Code: linear equation

# below is the process of finding a linear equation in mathematical terms
# our line`s slope is 2.5
# evaluate c to find the equation

 

m = 2.5

c = y_mean - m * x_mean

print ( `Intercept` , c)

Output:

 Intercept -2.0 

Code: Medium square error

Our regression line equation looks like this:
# y_pred = 2.5x-2.0
# we name the string y_pred
# insert a regression line plot

from sklearn.metrics import mean_squared_error 

# y_pred for our exhaustive data points as shown below

 

y = [ 1 , 2 , 3 , 6 ]

y_pred = [ 0.5 , 3 , 3 , 5.5 ]

# sklearn root mean square

mse = math.sqrt (mean_squared_error (y, y_pred))

print ( `Root mean square error` , mse)

Output:

 Root mean square error 0.6123724356957945 

Code: RMSE Calculation

# let`s see how the mean square is calculated mathematically
# let`s introduce a term called residuals
# remainder - it is basically the distance from the data point to the regression line
# the residuals are indicated by the red line in the graph below
# RMS and residuals are calculated as shown below
# we have 4 data points
"" "
r = 1, ri = yi-y_pred
y_pred is mx + c
ri = yi- (mx + c)
eg x = 1, we have the y value as 1
we want to evaluate exactly what our model predicted for x = 1
(1, 1) r1 = 1, x = 2
"" "
# y_pred1 = 1- (2.5 * 1-2.0 ) = 0.5

r1 = 1 - ( 2.5 * 1 - 2.0 )

 
# (2, 2) r2 = 2, x = 2
# y_pred2 = 2- (2.5 * 2-2.0) = - 1

r2 = 2 - ( 2.5 * 2 - 2.0 )

 
# (2, 3) r3 = 3, x = 2
# y_pred3 = 3- (2.5 * 2-2.0) = 0

r3 = 3 - ( 2.5 * 2 - 2.0 )

  
# (3, 6) r4 = 4, x = 3
# y_pred4 = 6- (2.5 * 3-2.0) =. 5

r4 = 6 - ( 2.5 * 3 - 2.0 )

 
# on top of the calculation we have residual values ​​

residuals = [ 0.5 , - 1 , 0 ,. 5 ]

 
# now calculate the root mean square error
# N = 4 tons data points

N = 4

rmse = math.sqrt ((r1 * * 2 + r2 * * 2 + r3 * * 2 + r4 * * 2 ) / N)

print ( `Root Mean square error using maths` , rmse)

 
# RMS value actually calculated using math
# both RMSEs are calculated the same

Output:

 Root Mean square error using maths 0.6123724356957945 

R-squared error or coefficient of determination
Error R2 answers the following question.
How much y changes with change in x. Basically, the percentage change in y when changing from x

Code: R-Squared Error

# SEline = (y1- (mx1 + b ) ** 2 + y2- (mx2 + b) ** 2 ... + yn- (mxn + b) ** 2)
# SE_line = ( 1- (2.5 * 1 + (- 2)) ** 2 + (2- (2.5 * 2 + (- 2)) ** 2) + (3- (2.5 * (2) + (-) 2)) ** 2) + (6- (2.5 * (3) + (- 2)) ** 2))

 

val1 = ( 1 - ( 2.5 * 1 + ( - 2 ))) * * 2

val2 = ( 2 - ( 2.5 * 2 + ( - 2 ))) * * 2

val3 = ( 3 - ( 2.5 * 2 + ( - 2 ))) * * 2

val4 = ( 6 - ( 2.5 * 3 + ( - 2 ))) * * 2

SE_line = val1 + val2 + val3 + val4

print ( `val` , val1, val2, val3, val4)

 
# next to the total deviation of Y from the mean
# the change in y is calculated as
# y_var = (y1-ymean) ** 2+ (y2-ymean) ** 2 ... + (yn-ymean) 2

 

y = [ 1 , 2 , 3 , 6 ]

 

y_var = ( 1 - 3 ) * * 2 + ( 2 - 3 ) * * 2 + ( 3 - 3 ) * * 2 + ( 6 - 3 ) * * 2

SE_mean = y_var

 
# by computing y_var, we compute the distance
# between data points y and mean y
# so answer to our question,% of total deviation
# of x is denoted as below:

r_squared = 1 - (SE_line / SE_mean)

  
# [SE_line / SE_mean] - & gt; tells us what% variation
# not described by the regression line
# 1- (SE_li ne / SE_mean) - & gt; gives us the exact meaning
# how much% y varies with x

print ( `Rsquared error` , r_squared)

Output:

 Rsquared error 0.8928571428571429 

Code: R-Squared Error with sklear

from sklearn.metrics import r2_score

  
Error # r2 calculated by sklearn is similar
# to our mathematically calculated error r2
# compute r2 error with sklearn
r2 _score (y, y_pred)

Output:

0.8928571428571429 
Get Solution for free from DataCamp guru