Linear Regression — it is a statistical approach for modeling the relationship between a dependent variable with a given set of explanatory variables.
Note: in this article, we refer to dependent variables as response, and independent variables — as functions for simplicity.
To provide a basic understanding of linear regression, we’ll start with the most basic variation of linear regression, which is simple linear regression .
Simple Linear Regression
Simple Linear Regression — it is an approach for predicting a response using a single function .
The two variables are assumed to be linearly related. Therefore, we are trying to find a linear function that predicts the value of the response (y) as accurately as possible, like a function of a feature or independent variable (x).
Let’s consider a dataset in which we have a value answer y for each attribute x:
For generality we define:
x as a vector of features , that is, x = [x_1, x_2,…., x_n],
y as a vector of response , then there is y = [y_1, y_2,…., y_n]
for n observations (in the above example n = 10).
Scatter plot of the specified dataset looks like this:
Now the task is to find the line that best fits on the scatter plot above so that we can predict the answer for any new characteristic values ... (i.e. no x value in the dataset)
This line is called the regression line .
The regression line equation is represented as:
Here,
 h (x_i) represents predicted response value for ith observation.
 b_0 and b_1 — the regression coefficients represent the y intersection and slope of the regression line, respectively.
To create our model, we must "learn" or estimate the values of the regression coefficients b_0 and b_1. And after we have estimated these coefficients, we can use the model to predict responses!
For this article, we are going to use the least squares technique .
Now consider:
Here e_i — this is the residual error in the ith observation.
So, our goal is — minimize the overall residual error.
We define the squared error or cost function, J as:
and our task — find the values b_0 and b_1 for which J (b_0, b_1) is minimal!
Without going into mathematical details, let’s present the result here:
where SS_xy — sum of crossdeviations y and x:
and SS_xx — sum of squared deviations x:
Note. The complete output for finding least squares estimates in simple linear regression can be found here .
Below is the implementation of the above technique on our small python dataset:

Output of the above piece of code:
Estimated coefficients: b_0 = 0.0586206896552 b_1 = 1.45747126437
And the resulting graph looks like this:
Multiple Linear Regression
Multiple Linear Regression is trying to simulate the relationship between two or more features and the response by fitting a linear equation to the observed data.
Clearly, this is nothing more than an extension of just 1st Linear Regression.
Consider a dataset with p functions (or independent variables) and one answer (or dependent variable).
In addition, the dataset contains n rows / cases.
We define:
X ( feature matrix ) = matrix of size n X p, where x_ {ij} denotes the values of the jth feature for the ith observation.
So,
and
y ( response vector ) = n size vector , where y_ {i} is the answer value for the ith observation.
The regression line for p objects is represented as:
where h (x_i) — predicted response value for the ith observation, and b_0, b_1,…, b_p — regression coefficients .
We can also write:
where e_i represents the residual error in the ith observation.
We can generalize our linear model a bit by presenting the X feature matrix :
So now the linear model can be expressed as matrices like:
where
and
Now define grade b , i.e. b & # 39; using least squares .
As already explained, least squares tends to define b & # 39; for which the total residual error is minimized.
We present the result right here:
where & # 39; represents the transposition of the matrix, and 1 represents the inverse of the matrix.
Knowing the least squares estimates, b & # 39;, the multiple linear regression model can now be estimated as:
where & # 39;  estimated response vector .
Note . The complete output for least squares estimates in multiple linear regression can be found here .
Below is an implementation of a multiple linear regression method on a Boston house price dataset using Scikitlearn.
import
matplotlib.pyplot as plt
import
numpy as np
from
sklearn
import
datasets, linear_model, metrics
# load boston dataset
boston
=
datasets.load_boston (return_X_y
=
False
)
# definition of the feature matrix (X) and the response vector (y)
X
=
boston.data
y
=
boston.target
# splitting X and Y into training and test suites
from
sklearn.model_selection
import
train_test_split
X_train, X_test, y_train, y_test
=
train_test_split (X, y, test_size
=
0.4
,
random_state
=
1
)
# create linear regression object
reg
=
linear_model.LinearRegression ()
# train the model using training sets
reg.fit (X_train, y_train)
# regression coefficients
print
(
’Coefficients:’
, reg .coef_)
# variance score: 1 means perfect forecast
print
(
’Variance score: {}’
.
format
(reg.score (X_test, y_test)))
# schedule residual error
## setting plot style
plt.style.use (
’fivethirtyeight’
)
## building residual errors in training data
plt.scatter (reg.predict (X_train), reg.predict (X_train)

y_train,
color
=
"green"
, s
=
10
, label
=
’Train data’
)
## building residual errors in test data
plt.scatter (reg.predict (X_test), reg. predict (X_test)

y_test,
color
=
"blue"
, s
=
10
, label
=
’ Test data’
)
## plotting a line for zero residual error
plt.hlines (y
=
0
, xmin
=
0
, xmax
=
50
, linewidth
=
2
)
## sketching the legend
plt.legend (loc
=
’upper right’
)
## story title
plt.title (
"Residual errors"
)
## plot show function
plt.show ()
The output of the above program looks like this:
Coefficients: [8.80740828e02 6.72507352e02 5.10280463e02 2.18879172e + 00 1.72283734e + 01 3.62985243e + 00 2.13933641e03 1.36531300e + 00 2.88788067e01 1.22618657e02 8.36014969e01 9.5305805061e0316 e01] Variance score: 0.720898784611
and The residual error graph looks like this:
In the above In the above example, we define an accuracy score using the variance score explained .
We define:
explained_score_variant = 1 — Var {y — y & # 39;} / Var {y}
where y & # 39;  calculated target result, y — the corresponding (correct) target output, and Var — variance, squared standard deviation.
Best Possible Score — 1.0, lower values — worse.
Assumptions
The following are the basic assumptions that the linear regression model makes for the dataset to which it is applied:
 Linear relationship : The relationship between response and characteristic variables must be linear. The linearity assumption can be tested using scatterplots. As shown below, the 1st figure represents linearly related variables, where the variables in the 2nd and 3rd figures are likely to be nonlinear. So the first number will give the best predictions using linear regression.
 Little or no multicollinearity. It is assumed that the multicollinearity of the data practically absent. Multicollinearity occurs when features (or explanatory variables) are not independent of each other.
 Little or no autocorrelation: Another assumption is that there is little or no autocorrelation in the data. Autocorrelation occurs when residual errors are independent of each other. You can refer to here for a deeper understanding of this topic.
 Homoscedasticity ... Homoscedasticity describes a situation in which the term error (that is, "noise" or random disruption in the relationship between the independent variables and the dependent variable) is the same for all values of the independent variables. As shown below, Figure 1 has homoscedasticity, while Figure 2 has heteroscedasticity.
When we get to the end of this article, we will discuss some applications of linear regression below.
Applications:
1. Trend lines . A trend line represents the change in some quantitative data over time (eg GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression can be applied to predict future values. However, this method suffers from a lack of scientific validity in cases where other potential changes could affect the data.
2. Economics. Linear regression is the predominant empirical tool in economics. For example, it is used to forecast consumption spending, fixed capital investment, inventory investment, purchasing a country’s exports, import spending, demand for liquid assets, demand for labor, and supply of labor.
3. Finance. The capital price asset model uses linear regression to analyze and quantify systematic investment risks.
4. Biology. Linear regression is used to model causal relationships between parameters in biological systems.
Links:
 https://en.wikipedia.org/wiki/Linear_regression
 https://en.wikipedia.org/wiki/Simple_linear_regression
 http://scikitlearn.org/stable/auto_examples/linear_model/plot_ols.html
 http://www.statisticssolutions.com/assumptionsoflinearregression/
This blog is courtesy of by Nikhil Kumar . If you like Python.Engineering and would like to contribute, you can also write an article using contrib.python.engineering, or email your article to [email protected] See my article appearing on the Python.Engineering homepage and help other geeks.
Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.