  # Linear Regression (Python Implementation)

File handling | NumPy | Python Methods and Functions

Linear Regression — it is a statistical approach for modeling the relationship between a dependent variable with a given set of explanatory variables.

Note: in this article, we refer to dependent variables as response, and independent variables — as functions for simplicity.

To provide a basic understanding of linear regression, we`ll start with the most basic variation of linear regression, which is simple linear regression .

## Simple Linear Regression

Simple Linear Regression — it is an approach for predicting a response using a single function .

The two variables are assumed to be linearly related. Therefore, we are trying to find a linear function that predicts the value of the response (y) as accurately as possible, like a function of a feature or independent variable (x).

Let`s consider a dataset in which we have a value answer y for each attribute x: For generality we define:

x as a vector of features , that is, x = [x_1, x_2,…., x_n],

y as a vector of response , then there is y = [y_1, y_2,…., y_n]

for n observations (in the above example n = 10).

Scatter plot of the specified dataset looks like this: Now the task is to find the line that best fits on the scatter plot above so that we can predict the answer for any new characteristic values ... (i.e. no x value in the dataset)

This line is called the regression line .

The regression line equation is represented as: Here,

• h (x_i) represents predicted response value for i-th observation.
• b_0 and b_1 — the regression coefficients represent the y- intersection and slope of the regression line, respectively.

To create our model, we must "learn" or estimate the values ​​of the regression coefficients b_0 and b_1. And after we have estimated these coefficients, we can use the model to predict responses!

For this article, we are going to use the least squares technique .

Now consider: Here e_i — this is the residual error in the i-th observation.
So, our goal is — minimize the overall residual error.

We define the squared error or cost function, J as: and our task — find the values ​​b_0 and b_1 for which J (b_0, b_1) is minimal!

Without going into mathematical details, let`s present the result here:  where SS_xy — sum of cross-deviations y and x: and SS_xx — sum of squared deviations x: Note. The complete output for finding least squares estimates in simple linear regression can be found here .

Below is the implementation of the above technique on our small python dataset:

 ` import ` ` numpy as np ` ` import ` ` matplotlib.pyplot as plt `   ` def ` ` estimate_coef (x, y): ` ` # number of observations / points ` ` n ` ` = ` ` np. size (x) ` ` `  ` ` ` # average of the vector x and y ` ` m_x, m_y ` ` = ` ` np.mean (x), np.mean (y) ` ` `  ` # calculating the cross and x variance ` ` SS_xy ` ` = ` ` np. ` ` sum ` ` (y ` ` * ` ` x) ` ` - ` ` n ` ` * ` ` m_y ` ` * ` ` m_x ` ` ` ` SS_xx ` ` = ` ` np. ` ` sum ` ` (x ` ` * ` ` x) ` ` - ` ` n ` ` * ` ` m_x ` ` * ` ` m_x `   ` # calculating regression coefficients ` ` ` ` b_1 ` ` = ` ` SS_xy ` ` / ` ` SS_xx ` ` b_0 ` ` = ` ` m_y ` ` - ` ` b_1 ` ` * ` ` m_x `   ` ` ` return ` ` (b_0 , b_1) `   ` def ` ` plot_regression_line (x, y, b): ` ` # plotting actual points as a scatter plot ` ` plt.scatter (x, y, color ` ` = ` ` "m" ` `, ` ` marker ` ` = ` ` "o" , s = 30 ) ``   # predicted vect response op y_pred = b [ 0 ] + b [ 1 ] * x    # drawing a regression line plt.plot (x, y_pred, color = "g" )   # labeling plt.xlabel ( `x` )   plt.ylabel ( ` y` )   # plot display function plt.show ()   def main (): # observations x = np.array ([ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]) y = np.array ([ 1 , 3 , 2 , 5 , 7 , 8 , 8 , 9 , 10 , 12 ])   # score odds b = estimate_coef (x, y) print ("Estimated coefficients: b_0 = {}  b_1 = {} ". format (b [ 0 ], b [ 1 ]))   # plotting a regression line plot_regression_line (x, y, b)   if __ name__ = = " __ main__ " : main () `

Output of the above piece of code:

` Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437 `

And the resulting graph looks like this: ## Multiple Linear Regression

Multiple Linear Regression is trying to simulate the relationship between two or more features and the response by fitting a linear equation to the observed data.

Clearly, this is nothing more than an extension of just 1st Linear Regression.

Consider a dataset with p functions (or independent variables) and one answer (or dependent variable).
In addition, the dataset contains n rows / cases.

We define:

X ( feature matrix ) = matrix of size n X p, where x_ {ij} denotes the values ​​of the j-th feature for the i-th observation.

So, and

y ( response vector ) = n size vector , where y_ {i} is the answer value for the i-th observation. The regression line for p objects is represented as: where h (x_i) — predicted response value for the i-th observation, and b_0, b_1,…, b_p — regression coefficients .

We can also write: where e_i represents the residual error in the i-th observation.

We can generalize our linear model a bit by presenting the X feature matrix : So now the linear model can be expressed as matrices like: where and Now define grade b , i.e. b & # 39; using least squares .

As already explained, least squares tends to define b & # 39; for which the total residual error is minimized.

We present the result right here: where & # 39; represents the transposition of the matrix, and -1 represents the inverse of the matrix.

Knowing the least squares estimates, b & # 39;, the multiple linear regression model can now be estimated as: where & # 39; - estimated response vector .

Note . The complete output for least squares estimates in multiple linear regression can be found here .

Below is an implementation of a multiple linear regression method on a Boston house price dataset using Scikit-learn.

` `

 ` import ` ` matplotlib.pyplot as plt ` ` import ` ` numpy as np ` ` from ` ` sklearn ` ` import ` ` datasets, linear_model, metrics `   ` # load boston dataset ` ` boston ` ` = ` ` datasets.load_boston (return_X_y ` ` = ` ` False ` `) `   ` # definition of the feature matrix (X) and the response vector (y) ` ` X ` ` = ` ` boston.data `` y = boston.target   # splitting X and Y into training and test suites from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.4 , random_state = 1 )   # create linear regression object reg = linear_model.LinearRegression ()    # train the model using training sets reg.fit (X_train, y_train)   # regression coefficients print ( `Coefficients:` , reg .coef_)   # variance score: 1 means perfect forecast print ( `Variance score: {}` . format (reg.score (X_test, y_test)))   # schedule residual error   ## setting plot style plt.style.use ( `fivethirtyeight` )   ## building residual errors in training data plt.scatter (reg.predict (X_train), reg.predict (X_train) - y_train, color = "green" , s = 10 , label = `Train data` )    ## building residual errors in test data plt.scatter (reg.predict (X_test), reg. predict (X_test) - y_test,   color = "blue" , s = 10 , label = ` Test data` )   ## plotting a line for zero residual error plt.hlines (y = 0 , xmin = 0 , xmax = 50 , linewidth = 2 )   ## sketching the legend plt.legend (loc = `upper right` )   ## story title plt.title ( "Residual errors" )   ## plot show function plt.show () `
` `

` `

The output of the above program looks like this:

` Coefficients: [-8.80740828e-02 6.72507352e-02 5.10280463e-02 2.18879172e + 00 -1.72283734e + 01 3.62985243e + 00 2.13933641e-03 -1.36531300e + 00 2.88788067e-01 -1.22618657e-02 -8.36014969e-01 9.5305805061e-0316 e-01] Variance score: 0.720898784611 `

and The residual error graph looks like this: In the above In the above example, we define an accuracy score using the variance score explained .
We define:
explained_score_variant = 1 — Var {y — y & # 39;} / Var {y}
where y & # 39; - calculated target result, y — the corresponding (correct) target output, and Var — variance, squared standard deviation.
Best Possible Score — 1.0, lower values ​​— worse.

## Assumptions

The following are the basic assumptions that the linear regression model makes for the dataset to which it is applied:

• Linear relationship : The relationship between response and characteristic variables must be linear. The linearity assumption can be tested using scatterplots. As shown below, the 1st figure represents linearly related variables, where the variables in the 2nd and 3rd figures are likely to be non-linear. So the first number will give the best predictions using linear regression. • Little or no multicollinearity. It is assumed that the multicollinearity of the data practically absent. Multicollinearity occurs when features (or explanatory variables) are not independent of each other.
• Little or no autocorrelation: Another assumption is that there is little or no autocorrelation in the data. Autocorrelation occurs when residual errors are independent of each other. You can refer to here for a deeper understanding of this topic.
• Homoscedasticity ... Homoscedasticity describes a situation in which the term error (that is, "noise" or random disruption in the relationship between the independent variables and the dependent variable) is the same for all values ​​of the independent variables. As shown below, Figure 1 has homoscedasticity, while Figure 2 has heteroscedasticity. When we get to the end of this article, we will discuss some applications of linear regression below.

## Applications:

1. Trend lines . A trend line represents the change in some quantitative data over time (eg GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression can be applied to predict future values. However, this method suffers from a lack of scientific validity in cases where other potential changes could affect the data.

2. Economics. Linear regression is the predominant empirical tool in economics. For example, it is used to forecast consumption spending, fixed capital investment, inventory investment, purchasing a country`s exports, import spending, demand for liquid assets, demand for labor, and supply of labor.

3. Finance. The capital price asset model uses linear regression to analyze and quantify systematic investment risks.

4. Biology. Linear regression is used to model causal relationships between parameters in biological systems.