Contrary to popular belief, logistic regression is a regression model. The model builds a regression model to predict the likelihood that a given data record belongs to the category designated “1”. Just like Linear Regression assumes that the data follows a linear function, Logistic Regression models the data using a sigmoidal function.
Logistic regression becomes a classification technique only when a decision threshold is introduced into the picture. Setting the threshold is a very important aspect of logistic regression and depends on the classification problem itself.
The decision on the threshold value depends mainly on the values precision and recall. Ideally, we want both precision and recall to be 1, but this is rarely the case. In the case of a Precision-Recall compromise, we use the following arguments to decide the threshold:
1. Low fidelity / high re-invocation. Where we want to reduce the number of false negatives without the need to reduce false positives, we choose a solution value that has a low precision value or a high recall value. For example, when diagnosing cancer, we do not want any sick patient to be classified as unaffected, ignoring if the patient is mistakenly diagnosed with cancer. This is because the absence of cancer can be detected by other medical conditions, but the presence of the disease cannot be detected in an already rejected candidate.
2. High accuracy / Low recall: In those cases When we want to reduce the number of false positives without having to reduce the number of false negatives, we choose a solution value that has a high Accuracy value or a low Recall value. For example, if we categorize customers as to whether they will respond positively or negatively to personalized ads, we want to be absolutely sure that the customer will respond positively to the ad, because otherwise negative feedback could lead to a loss of potential sales from the customer. ,
Based on the number of categories, logistic regression can be classified as:
First of all, we will investigate the simplest form of logistic regression, i.e. binomial logistic regression .
Binomial Logistic Regression
Let`s look at an example dataset that compares the number of training hours to exam results. The result can only take two values: passed (1) or failed (0):
|Hours (x)||0.50||0.75||1.00||1.25||1.50||1.75||2.00 td>||2.25||2.50||2.75||3.00||3.25||3.50||3.75||4.00||4.25||4.50||4.75||5.00||5.50|
|Pass (y)||0||0||0||0||0||0||1||0||1||0||1||0||1||0||1||1 td>||1||1||1||1|
So we have
ie y is a categorical target variable that can only take two possible types: “0” or “1”.
To generalize our model, we assume that:
If you went through linear regression, you must remember that in linear regression, the hypothesis we used to predict was:
where, are the coefficients regression.
Let the regression coefficient be matrix / vector, be:
Then, in a more compact form,
The reason for taking
= 1 is pretty clear now.
We needed to do a matrix product, but there was no
amp -img> multiplied to in original hypothesis formula. So, we defined = 1.
Now, if we try to apply linear regression to the above problem, we will probably get continuous values using the hypothesis we discussed above. In addition, it does not make sense for to take values greater than 1 or less than 0.
So, we have made some changes to the hypothesis for classification :
called logistic function or sigmoid function .
Here is a graph showing g (z):
From the above graph, we can conclude, what:
So now we can define conditional probabilities for 2 labels (0 and 1) for observation as:
We can write this more compactly as:
We now define another term, parameter probability as:
Likelihood is nothing but the probability of data (training examples), given a model and specific parameter values (here,
). It measures the support provided by the data for each possible value of the . We obtain it by multiplying all for given .
And for simpler calculations we take the logarithmic probability :
value function for logistic regression is proportional to the inverse probability of the parameters. Therefore, we can get an expression for the cost function J using the log likelihood equation as:
and our goal is is to evaluate so the cost function is minimized !!
Using the gradient descent algorithm
First, we take partial derivatives for each to get the stochastic gradient descent rule (here we only present the final derived value):
Here y and h (x) represent the response vector and the predicted response vector (respectively). Also, a vector representing the observation values for feature.
Now to get min ,
where is called learning rate and must be set explicitly.
Let`s see the implementation of the above technique in Python using a dataset as an example (download it from here a>):
Submit new EBook