Logistic Regression
Sigmoid
Logistic regression lesson goes immediately after Linear Regression, and instead of usage
predicting scalar number it is used for binary classification and predicts the probability
of an observation belonging to a particular class. It models the relationship
between input features (parameters) and the binary outcome (0 or 1).
The output of logistic regression is transformed using the logistic function
(sigmoid), which maps any real-valued number to a value between 0 and 1. This
transformed value can be interpreted as the probability of the observation
belonging to the positive class.
Logistic function is also known as sigmoid function. It compressed the output of
the linear regression between 0 and 1. It can be defined as:
\( z = \theta^T \cdot X \)
\( S(z) = \frac{1}{1 + e^{-z}} \)
Where z is output of linear regression, theta are learnable parameters and X is feature vector of you inputs.
It basically means than smaller output of linear regression is then smaller probability class 0 and vice versa with class 1.
Picture below demonstrates this effect.


Sigmoid function is accurately separating the classes for binary classification tasks. Also, it produces continuous values exclusively within the 0 to 1 range, which can be employed for predictive purposes.
This function will be used in future to understand how Some Reinforcement Learning Algorithm for example works.
We will learn parameters theta that allows our Agent Achieve the highest possible Reward in Environment.
We will manually calculate derivatives of this Model/Policy that allows transparently understand how works one of Base RL
and my favorite Algorithms REINFORCE, and after it will help us smoothly switch this model inside algorithm into more complicated models for example
Neuron Networks which we will be learning soon in a few lesson ahead.
In future we will need not only sigmoid function but also it's derivative. This function is useful and becomes in handy
for example as last layer of Neuron Network. As last layer for forward pass we use it and
it's derivative is used for gradient calculation of backward pass.
So i decided to pin its graphic here as already now.


As small home task you can calculate its derivative manually using chain rule and other calculus rules to obtain same formulas
Example of usage logistic regression is presented below. You can use it for purpose
wheather person has or doesn't have diabetic disease based on sugar level in his blood.
Then more sugar you have then higher probability and tries to figure out what is the most
accurate separation line of these two classes.


What is Cost Function?
A cost function is a mathematical function that calculates the difference
between the target actual values (ground truth) and the values predicted by the
model. A function that assesses a machine learning model’s performance also
referred to as a loss function or objective function. Usually, the objective of a
machine learning algorithm is to reduce the error or output of cost function.
As you remember in Linear Regression the conventional Cost Function is the Mean
Squared Error. Formula below for one sample
\( \text{Cost} (\hat{y}_i, y_i) = \frac{1}{2} (\hat{y}_i - y_i)^2 \)
The cost function J for m training samples can be written as:
\( \text{J(θ)} = \frac{1}{m} \sum_{i=1}^{m} \text{Cost} (\hat{y}_i, y_i) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2 \)
but it is not suitable for logistic regression due to its nonlinearity introduced by the sigmoid function.
In logistic regression, if we substitute the sigmoid function into the above MSE
equation, we get:
\( \text{J(θ)} = \frac{1}{2m} \sum_{i=1}^{m} (\frac{1}{1 + e^{-(\theta^T \cdot X )}}- y_i)^2 = \)
\( \frac{1}{2m} \sum_{i=1}^{m} (\frac{1}{1 + e^{-(\theta_1 \cdot X_i + \theta_2)}}- y_i)^2
\)
This equation is a nonlinear transformation, and evaluating this term within the
Mean Squared Error formula results in a non-convex cost function. A non-convex
function, have multiple local minima which can make it difficult to optimize using
traditional gradient descent algorithms as shown below.


We should find another cost function instead, one which has the same behavior
but is easier to find its minimum point.
Let's plot the desirable cost function for our model.
Our actual value is y which equals 0 or 1, and our model tries to estimate
it as we want to find a simple cost function for our model.
For a moment assume that our desired value for y is 1. This means our model is
best if it estimates y equals 1. In this case, we need a cost function that
returns 0 if the outcome of our model is 1, which is the same as the actual
label. And the cost should keep increasing as the outcome of our model gets
farther from 1. And cost should be very large if the outcome of our model is
close to 0.


Model \( \hat{y} \)
Actual Y value equal 0 or 1
if Y = 1, and \( \hat{y} \) = 1 -> cost = 0
if Y = 0, and \( \hat{y} \) = 1 -> cost = large
Let's repeat logarithm function and its behaviour. We are mostly interested in range [0,1].
if you multiply this function to -1 it will be very similar and good approximation of function
points examples on the picture above.


We can see that the minus log function provides such a cost function for us.


It means if the actual value is one and the model also predicts one, the minus
log function returns zero cost. But if the prediction is smaller than one, the
minus log function returns a larger cost value. So, we can use the minus log
function for calculating the cost of our logistic regression model. So, if you
recall, we previously noted that in general it is difficult to calculate the
derivative of the cost function. Well, we can now change it with the minus log of
our model. We can easily prove that in the case that desirable y is one, the cost
can be calculated as minus log y hat, and in the case that desirable y is zero
the cost can be calculated as minus log one minus y hat. Now, we can plug it into
our total cost function and rewrite it as this function.
\[
\text{Cost} (\hat{y}_i, y_i) =
\begin{cases}
-log(\hat{y}) & \text{if y = 1} \\
-log(1-\hat{y}) & \text{if y = 0}
\end{cases}
\]
\( \text{J(θ)} = \frac{1}{2m} \sum_{i=1}^{m} (y_i * -log(\hat{y_i}) + (1-y_i) * -log(1-\hat{y_i})) \)
Here you can notice that our loss function consists of two parts. Our values/labels are 0 or 1.
One of loss elements becomes zero and we work exactly with loss of class our label.
So, this is the logistic regression cost function which is called Log Loss
or Cross Entropy function. As you can see for yourself it penalizes
situations in which the class is zero and the model output is one, and vice versa.
- Case 1: If y = 1, the true label of the class is 1. Cost = 0 if the predicted value of the y is 1 as well. But as predicted value of y deviates from 1 and approaches 0 cost function increases exponentially and tends to infinity
- Case 2: If y = 0, that is the true label of the class is 0. Cost = 0 if the predicted value of the y is 0 as well. But as predicted value of the y deviates from 0 and approaches 1 cost function increases exponentially and tends to infinity which can be appreciated from the below graph as well.
Remember, however, that Model output y does not return a class as output, but
it's a value of zero or one which should be assumed as a probability.
Now, we can easily use this function to find the parameters of our model in such
a way as to minimize the cost.
Now we can use gradient descent to find optimal parameters. Approach is identical
we considered in the previous part for linear regression.
What is gradient descent?
Generally, gradient descent is an iterative approach to finding the minimum of a
function.
Specifically in our case gradient descent is a technique to use the derivative of
a cost function to change the parameter values to minimize the cost or error.
How can gradient descent do that?
Think of the parameters or weights in our model to be in a two-dimensional space.
For example, θ1, θ2 for two feature sets.
We need to minimize the cost function J which is a function of variables θ1 and θ2.
So, let's add a dimension for the observed cost, or error, J function.
Let's assume that if we plot the cost function based on all possible values of θ1, θ2
Multinomial logistic regression
This approach is a classification method that generalizes logistic regression
to multiclass problems, i.e. with more than two possible discrete outcomes.
We use the softmax function or normalized exponential function converts a vector
of K real numbers into a probability distribution of K possible outcomes.
\[
\text{Softmax}(x_i) = \frac{e^{x_i}}{\displaystyle\sum_{j=1}^{n} e^{x_j}}
\]
S;/.oftmax applies the standard exponential function to each element zi of the input
vector z (consisting of real numbers), and normalizes these values by dividing by
the sum of all these exponentials. The normalization ensures that the sum of the
components of the output vector ϭ(z) is 1.
The multinomial logistic loss is actually the same as cross entropy.
\[
\text{Log Loss} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_{ik} \log(p_{ik})
\]
where m is the sample number, K is the class number.
\[
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(p_{ik})
\]
That is it now for Log regression. This will be used in future as everything we learn here.