Linear Regression (Python Implementation)



Linear Regression — it is a statistical approach for modeling the relationship between a dependent variable with a given set of explanatory variables.

Note: in this article, we refer to dependent variables as response, and independent variables — as functions for simplicity.

To provide a basic understanding of linear regression, we`ll start with the most basic variation of linear regression, which is simple linear regression .

Simple Linear Regression

Simple Linear Regression — it is an approach for predicting a response using a single function .

The two variables are assumed to be linearly related. Therefore, we are trying to find a linear function that predicts the value of the response (y) as accurately as possible, like a function of a feature or independent variable (x).

Let`s consider a dataset in which we have a value answer y for each attribute x:

For generality we define:

x as a vector of features , that is, x = [x_1, x_2,…., x_n],

y as a vector of response , then there is y = [y_1, y_2,…., y_n]

for n observations (in the above example n = 10).

Scatter plot of the specified dataset looks like this:

Now the task is to find the line that best fits on the scatter plot above so that we can predict the answer for any new characteristic values … (i.e. no x value in the dataset)

This line is called the regression line .

The regression line equation is represented as:

Here,

  • h (x_i) represents predicted response value for i-th observation.
  • b_0 and b_1 — the regression coefficients represent the y- intersection and slope of the regression line, respectively.

To create our model, we must “learn” or estimate the values ​​of the regression coefficients b_0 and b_1. And after we have estimated these coefficients, we can use the model to predict responses!

For this article, we are going to use the least squares technique .

Now consider:

Here e_i — this is the residual error in the i-th observation.
So, our goal is — minimize the overall residual error.

We define the squared error or cost function, J as:

and our task — find the values ​​b_0 and b_1 for which J (b_0, b_1) is minimal!

Without going into mathematical details, let`s present the result here:

where SS_xy — sum of cross-deviations y and x:

and SS_xx — sum of squared deviations x:

Note. The complete output for finding least squares estimates in simple linear regression can be found here .

Below is the implementation of the above technique on our small python dataset:

import numpy as np

import matplotlib.pyplot as plt

 

def estimate_coef (x, y):

# number of observations / points

n = np. size (x)

  

  # average of the vector x and y

m_x, m_y = np.mean (x), np.mean (y)

  

# calculating the cross and x variance

SS_xy = np. sum (y * x) - n * m_y * m_x

  SS_xx = np. sum (x * x) - n * m_x * m_x

 

# calculating regression coefficients

  b_1 = SS_xy / SS_xx

b_0 = m_y - b_1 * m_x

 

  return (b_0 , b_1)

 

def plot_regression_line (x, y, b):

# plotting actual points as a scatter plot

plt.scatter (x, y, color = "m" ,

marker = "o" , s = 30 )

 

# predicted vect response op

y_pred = b [ 0 ] + b [ 1 ] * x

  

# drawing a regression line

plt.plot (x, y_pred, color = "g" )

 

# labeling

plt.xlabel ( `x` )

  plt.ylabel ( ` y` )

 

# plot display function

plt.show ()

 

def main ():

# observations

x = np.array ([ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ])

y = np.array ([ 1 , 3 , 2 , 5 , 7 , 8 , 8 , 9 , 10 , 12 ])

 

# score odds

b = estimate_coef (x, y)

print ("Estimated coefficients: b_0 = {} 

b_1 = {} ". format (b [ 0 ], b [ 1 ]))

 

# plotting a regression line

plot_regression_line (x, y, b)

 

if __ name__ = = " __ main__ " :

main ()

Output of the above piece of code:

 Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437 

And the resulting graph looks like this:

Multiple Linear Regression

Multiple Linear Regression is trying to simulate the relationship between two or more features and the response by fitting a linear equation to the observed data.

Clearly, this is nothing more than an extension of just 1st Linear Regression.

Consider a dataset with p functions (or independent variables) and one answer (or dependent variable).
In addition, the dataset contains n rows / cases.

We define:

X ( feature matrix ) = matrix of size n X p, where x_ {ij} denotes the values ​​of the j-th feature for the i-th observation.

So,

and

y ( response vector ) = n size vector , where y_ {i} is the answer value for the i-th observation.

The regression line for p objects is represented as:

where h (x_i) — predicted response value for the i-th observation, and b_0, b_1,…, b_p — regression coefficients .

We can also write:

where e_i represents the residual error in the i-th observation.

We can generalize our linear model a bit by presenting the X feature matrix :

So now the linear model can be expressed as matrices like:

where

and

Now define grade b , i.e. b & # 39; using least squares .

As already explained, least squares tends to define b & # 39; for which the total residual error is minimized.

We present the result right here:

where & # 39; represents the transposition of the matrix, and -1 represents the inverse of the matrix.

Knowing the least squares estimates, b & # 39;, the multiple linear regression model can now be estimated as:

where & # 39; - estimated response vector .

Note . The complete output for least squares estimates in multiple linear regression can be found here .

Below is an implementation of a multiple linear regression method on a Boston house price dataset using Scikit-learn.

import matplotlib.pyplot as plt

import numpy as np

from sklearn import datasets, linear_model, metrics

 
# load boston dataset

boston = datasets.load_boston (return_X_y = False )

 
# definition of the feature matrix (X) and the response vector (y)

X = boston.data

y = boston.target

 
# splitting X and Y into training and test suites

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.4 ,

random_state = 1 )

 
# create linear regression object

reg = linear_model.LinearRegression ()

  
# train the model using training sets
reg.fit (X_train, y_train)

 
# regression coefficients

print ( `Coefficients:` , reg .coef_)

 
# variance score: 1 means perfect forecast

print ( `Variance score: {}` . format (reg.score (X_test, y_test)))

 
# schedule residual error

 
## setting plot style

plt.style.use ( `fivethirtyeight` )

 
## building residual errors in training data

plt.scatter (reg.predict (X_train), reg.predict (X_train) - y_train,

color = "green" , s = 10 , label = `Train data` )

  
## building residual errors in test data

plt.scatter (reg.predict (X_test), reg. predict (X_test) - y_test,

  color = "blue" , s = 10 , label = ` Test data` )

 
## plotting a line for zero residual error

plt.hlines (y = 0 , xmin = 0 , xmax = 50 , linewidth = 2 )

 
## sketching the legend

plt.legend (loc = `upper right` )

 
## story title

plt.title ( "Residual errors" )

 
## plot show function
plt.show ()

The output of the above program looks like this:

 Coefficients: [-8.80740828e-02 6.72507352e-02 5.10280463e-02 2.18879172e + 00 -1.72283734e + 01 3.62985243e + 00 2.13933641e-03 -1.36531300e + 00 2.88788067e-01 -1.22618657e-02 -8.36014969e-01 9.5305805061e-0316 e-01] Variance score: 0.720898784611 

and The residual error graph looks like this:

In the above In the above example, we define an accuracy score using the variance score explained .
We define:
explained_score_variant = 1 — Var {y — y & # 39;} / Var {y}
where y & # 39; - calculated target result, y — the corresponding (correct) target output, and Var — variance, squared standard deviation.
Best Possible Score — 1.0, lower values ​​— worse.

Assumptions

The following are the basic assumptions that the linear regression model makes for the dataset to which it is applied:

  • Linear relationship : The relationship between response and characteristic variables must be linear. The linearity assumption can be tested using scatterplots. As shown below, the 1st figure represents linearly related variables, where the variables in the 2nd and 3rd figures are likely to be non-linear. So the first number will give the best predictions using linear regression.
  • Little or no multicollinearity. It is assumed that the multicollinearity of the data practically absent. Multicollinearity occurs when features (or explanatory variables) are not independent of each other.
  • Little or no autocorrelation: Another assumption is that there is little or no autocorrelation in the data. Autocorrelation occurs when residual errors are independent of each other. You can refer to here for a deeper understanding of this topic.
  • Homoscedasticity ... Homoscedasticity describes a situation in which the term error (that is, "noise" or random disruption in the relationship between the independent variables and the dependent variable) is the same for all values ​​of the independent variables. As shown below, Figure 1 has homoscedasticity, while Figure 2 has heteroscedasticity.

When we get to the end of this article, we will discuss some applications of linear regression below.

Applications:

1. Trend lines . A trend line represents the change in some quantitative data over time (eg GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression can be applied to predict future values. However, this method suffers from a lack of scientific validity in cases where other potential changes could affect the data.

2. Economics. Linear regression is the predominant empirical tool in economics. For example, it is used to forecast consumption spending, fixed capital investment, inventory investment, purchasing a country`s exports, import spending, demand for liquid assets, demand for labor, and supply of labor.

3. Finance. The capital price asset model uses linear regression to analyze and quantify systematic investment risks.

4. Biology. Linear regression is used to model causal relationships between parameters in biological systems.

Links:

This blog is courtesy of by Nikhil Kumar . If you like Python.Engineering and would like to contribute, you can also write an article using contrib.python.engineering, or email your article to [email protected] See my article appearing on the Python.Engineering homepage and help other geeks.

Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.