Skip to content

Lesson 1. Linear Regression

Why do we start with linear regression? Because it’s beautiful, first of all. It’s easy to explain important basics of Machine Learning and the concepts of deep learning with the following example. We will need deep learning, because modern RL stands on this foundation.

Things you must understand after reading:

1) What derivatives and gradients are and why we need them;

2) The regression problem, how linear regression is built, metrics for linear regression, peculiarities, and different special cases, and why the aforementioned things happen the way they do;

3) How Stochastic Gradient Descent (SGD) works and the details of its implementation from scratch. We will be using pure Python.

Part 1. Derivatives and gradients

The definition of the derivative is as follows, and this is where it comes from. We remember two definitions of derivatives: physical and geometrical. The definition below is geometrical.

The derivative of the function \( f(x) \) with respect to \( x \) is defined as:

\[ f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \]
So, what does it really mean? We remember that the gradient shows how fast the function changes. The derivative is a function! The gradient is a scalar number – a concrete number of this function at a defined point x, which shows how fast the function grows or decreases at this precise point. When there’s a positive gradient number, the function increases; but when there’s a negative gradient number, the function decreases. When the gradient equals 0, the function neither grows nor declines. These rules hold true as Δx tends to 0.
The formula for the derivative comes from the graph below. You should look at it and match the formula with the graph.
Image title Image title
Then, the faster and steeper the function changes at that particular point, the larger the ratio proportion (numerator to denominator) will be, hence the gradient. Also, we have noticed the second thing: if the gradient is a positive number \( f(x + \Delta x) - f(x)>0 \), then the function grows (when we take a small step to the right). Otherwise, \( f(x + \Delta x) - f(x)<0 \) and the function decreases; the numerator of the fraction is a negative number, hence the gradient is negative.

Here we zoom in an important part of the graph.

Image title Image title
\[ f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} = \tg(α)=a/b\]

Here you have to clearly understand two things.

Then, the higher the absolute value of this number \( f(x + \Delta x) - f(x)>0 \) (value in the picture above), the greater the gradient. This means that the function \( f(x + \Delta x)\) at the point \( x + \Delta x \) differs significantly from the value at point \( x \) . If the difference is significant, it indicates rapid growth or decline.
We can determine whether the function is increasing or decreasing by the sign of the gradient. If the gradient is positive, the function is increasing. Otherwise, it is decreasing or remains unchanged.
A) \( f(x + \Delta x) - f(x)>0 \) function grows;
B) \( f(x + \Delta x) - f(x)<0 \) function decreases;
C) \( f(x + \Delta x) - f(x)=0 \) remains the same.
Back in school, we were taught that the gradient of a line equals the tangent of the angle of inclination alpha. This explanation directly relates to the definition shown in the picture above, concerning the right triangle.
Image title Image title

On pictures:
a) \( a>0 \) picture a - line grows;
b) \( a<0 \) - line decreases;
c) \( a=0 \) - it’s parallel x and stays the same.

Example with some numbers is below.

Image title Image title
The main thing to remember about gradients is that they show us how a function changes: how fast it changes and whether it grows or not. This concludes the section on gradients and derivatives.
Now, let's apply our knowledge of gradient properties to a concrete example. The task of linear regression is to construct a line that best approximates points with the smallest error.
Image title Image title
Gradients can help us to adjust the line's parameters to minimize this error and thereby better approximate the points. We can calculate gradients for any differentiable function. The SGD algorithm aids us in minimizing the Mean Squared Error using these gradients. We can achieve this by leveraging gradient properties to update parameters. Gradients indicate whether the error will increase or decrease if we adjust a specific parameter of the line. We can apply this process to all parameters. Ultimately, our goal is to reduce the error, so we adjust all parameters in the direction opposite to the gradients. After n iterations, we check for any stopping conditions and halt the SGD algorithm.
Now we are ready to build Linear Regression. We can formulate our task as follows: Firstly, mathematically: we have a set of points (possibly thousands) and we want to approximate them with a function that best fits our data (minimizing error). This is crucial because once we have this rule, we can predict values for other points where the answers are unknown.
The first step is to determine which type of approximation function to use: linear, non-linear, polynomial, quadratic, trigonometric, etc. Different datasets exhibit different patterns. To better understand this concept, consider the example of the Anscomba dataset, which illustrates examples when use it or not.

You can load this dataset for example from seaborn package

import seaborn as sns
anscombe = sns.load_dataset("anscombe")
data = anscombe[anscombe['dataset'] == 'I']
x = data['x']
y = data['y']
Image title Image title
Anscomba's quartet
These 9 points in each graphic share the same statistics such as mean and standard deviation, but their representation and patterns differ significantly.
  • The top-left graphic exhibits a linear pattern, making it a suitable candidate for Linear Regression;
  • The top-right graphic displays a quadratic pattern, best approximated by a parabolic function;
  • The bottom-left graphic includes an outlier. It's crucial to remove outliers and correlated features before applying machine learning algorithms, as they can significantly impact the final result;
  • The bottom-right example also does not fit well with a linear function application.

Ultimately, the choice of function for point approximation depends on your specific task, its physical implications, and your desired outcomes.

Keep this mem in mind.

Image title
Let’s consider a training dataset, illustrated by the orange points in the graphic below. Each orange point can be represented as (x, y), where x is a feature and y is a scalar target or outcome. In more complex scenarios, x may not be a single value but a vector of values, where each value describes a different attribute of an object. Our goal is to find a line that minimizes the error, such as the Mean Squared Error (MSE), which measures the average squared distance of all points from the line.
Another way to frame this task is as follows: Imagine explaining this problem to children. Picture a scenario where all the orange points represent citizens in a large city or village. Suppose there is no nearby river, but it's essential to dig a canal to provide water for daily needs. Through a voting process, we need to decide where and how to dig the canal so that all citizens are generally satisfied, and the overall satisfaction of the entire village is maximized. How do we measure the satisfaction of each citizen? By their distance to the river (represented by the blue line in the graphic). The greater the distance, the less satisfied the citizen. In our context, we use the square of the distance (since the derivative of a quadratic function is linear, making gradient calculation computationally straightforward). We need to determine the optimal locations to dig these canals for any number of citizens (points) and their specific locations (x and y coordinates). Similar problems can arise frequently. Additionally, if new citizens join the city or village, we should be prepared to adjust our canal plans accordingly. The important thing to mention that we can get the parameters for line equation using statistics without SGD, but for our study purposes we will use SGD, because we will use this SGD optimizer, as well as other in our RL projects many times, so we have to know the way it works. This is one of the purposes this linear regression article.
Let’s assume the initial proposal for the canal’s path is depicted below. The orange points represent citizens, and the blue line indicates the proposed canal location.
Image title
Proposed canal location and citizens
If we start with this initial suggestion, all the citizens might say it’s not quite right and suggest rotating and moving the canal closer to them. They repeat this process a few times until they reach a new configuration. Now, some citizens might start saying, "Hey, the location is fine, and we don’t even need to go out to get some water; the distance is 0 or close to it." However, most citizens are still not satisfied, and the total satisfaction of the village is not yet optimal. They suggest moving the canal even closer.
Image title
Improved canal location and citizens

Now it’s better for most of the citizens. It’s the best solution for the village based on democracy, and the decision on where to dig the channel is found.

Image title
The best canal location

This is right river location if all votes are fair and the cost of each citizen is a constant and equal to all other citizens.

Now let’s understand how we can find the best for any case river. And to do that SGD will help us.

Stochastic Gradient Descent (SGD)

Now, we will discuss the algorithm for finding the best-fit line. We have a training data set (x and y values). Our goal is to build the blue line \( y_{\text{pred}} = a \cdot x + b \), which requires us to find the coefficients a and b. We initialize the values of a and b randomly; let them be − 3 and 1 respectively for instance. The graphic below represents the line we want to find, the points from the training data set, and the orange line we start from in our search. The final line equation will be close to \( y = 5 \cdot x + 8 \), but different, because we added some noise to the y values, making the points not lie perfectly on one line, hence we will not be able to get an error of 0.
Image title
Stochastic Gradient Descent.
Each word in the title has a specific meaning. "Descent" indicates that we want to minimize errors/distances to the lines, and we need to change the line coefficients \(a\) and \(b\) in the opposite direction to which the gradient indicates it grows. "Gradient" means we use gradient information in the algorithm. "Stochastic" here means there’s a probabilistic effect in the algorithm. In the simple realization below, we won't use it. However, in more complicated examples, we usually don’t train each iteration on the full dataset. It is done on a small piece of the dataset for two reasons: easier calculation and random solutions help to find different local/global minima.
We need to figure out correct a and b line parameters. We have MSE metric that we want to minimise, gradients will help us to do that. We have prediction. We want that prediction was as much closer to the right answer as possible. Than closer a and b parameters to the ground true law, then smaller error.


\( y_{\text{pred}} = a \cdot x + b \)

The Mean Squared Error (MSE) is given by the formula

\( \text MSE \) = \( \frac{1}{n} \sum_{i=1}^{n} (y_i - y_{\text{pred}})^2 \) = \( \frac{1}{n} \sum_{i=1}^{n} \left(y_i - (a \cdot x_i + b)\right)^2 \).


The most important part is that we need to have the partial derivatives of MSE for the a and b parameters. Below is the general formula, but we have a training data set. We calculate the gradient for each point by inserting x and the true values of y into the formula, along with the current values of a and b. Then, we calculate the mean of these gradients, as you will see in the code below. Afterward, we update a and b based on these gradients and the learning rate (lr).
The learning rate determines the size of the updates toward improvements. It is important because we can only trust the gradient locally, in a small area. If the learning rate is too small, the learning process will be very slow as we will make very small updates. However, if the learning rate is too large, we might step out of the local and global minimums and will not find the line with the smallest error. Therefore, we need to find an optimal value for the learning rate.
The partial derivative of the MSE with respect to \( a \) and \( b \) are:
\[ \frac{\partial \text{MSE}}{\partial a} = \frac{2}{n} \sum_{i=1}^{n} x_i \left((a \cdot x_i + b) - y_i\right) \] \[ \frac{\partial \text{MSE}}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} \left((a \cdot x_i + b) - y_i\right) \]

Gradient Descent Algorithm Pseudocode:

Prepare training data set (orange points (x, y)). Initialize the line (parameters a and b) with random values. Initialize lr (learning rate) to 0.001 - we make small steps in the direction opposite to the gradient. Initialize number_of_iterations to 10000 (sufficient for this task).

Start loop:

1) Calculate predictions using line parameters a and b and the training data set x and y;
2) Calculate error to monitor that it decreases and ensure the algorithm works properly;
3) Calculate partial derivatives of a and b for MSE for all training points and find the mean 4) Update a and b slightly in the direction opposite to the gradient;
5) Repeat loop until any stop condition is met:
* Number of iterations is achieved;
* Error stops decreasing;
* All gradients are 0.

import numpy as np
import matplotlib.pyplot as plt
import random
from random import randint

# the line we will be searching for
linear = lambda x: 5*x+8

# good practice is to monitor error during optimization
def mse(y_true,y_pred):
    return sum((y_true-y_pred)**2)/len(y_true)


if __name__ == '__main__':
    random.seed(8) # lock seed to repeat experiment

    # creating training data set x and y
    x = np.array(range(20)) 
    y = np.array([linear(i)+ randint(-10, 10) for i in x])

    # initialize parameters of the line we will be optimizing
    a, b = random.randint(-10,10), random.randint(-10,10)

    # SGD part starts, initialize 
    lr = 0.0005 # controls steps size
    num_iteration = 10000 # amount of time we want to update line


    # iterative process of searching for the line with the          #smallest MSE
    for i in range(num_iteration):
        pred = a*x+b
        error = mse(y, pred)

     # calculate partial derivatives
        df_da = np.dot(x,2*(pred-y))/len(pred)
        df_db = sum(2*(pred-y))/len(pred)

     # update line parameters
        a-=lr*df_da
        b-=lr*df_db
        if i%100==0:
            print(f"errors {error}")

Notice that you can delete 2 from derivative, because here it’s just multiplication to the constant what can be incorporated into learning rate.

So what we have The main metric we have to monitor during training is error. If the error decreases everything goes right (in the most cases)

Image title Image title Image title
MSE loss, grad a and b for training process

You see that gradients were big in the beginning it changed parameters fast in the beginning because error was big - big-big gradients, than smaller error then smaller gradients to fix this error.

The second thing to note If you look at gradients of parameters a and b you can notice a few things. You could stop iterations earlier d

Image title
Line improvements during training

So, it was Linear Regression for general case.

Now in the end let's look at Some special cases. If we have one point in training data set, angle can be different and we have infinite number of solutions.

Image title Image title
Case for one point, the right answers, different lines are possible

If we have 2 points we have only one solution, error is 0 and sum of square distances is 0.

Image title
Case for 2 points

For two dots the right answer is only 1 possible line, same as for all other amount of points.

For three points, solution must be parallel to the longest side of triangle and L^2 must equal m12+m22.

Image title
3 Points

The smallest MSE is achieved when line is parallel the longest side and sum of squares distances from this side is equal square of distance from the last third dot.

For four points more cases and so on.

Another thing to look at when we update only one out of two parameters only a or only b.

Image title
Line we start from
Image title
Update only b. We only change abscises value of the line, but not angle
Image title
Update only a. In that sense a is more important parameter to minimize the error which formulates the error.
Image title
And a and b are updated. General case when everything is fine

Conclusion

We repeated how to build linear regression as well as other important for Deep learning concepts as gradients and optimizers (SGD is the simplest one). The next lesson will be devoted to Neuron Networks. We will try to solve spiral classification network using Pytorch. Pytorch is popular autograd package for deep learning in python. And the third important for understanding lesson-video from Andrey karpathy how to build your own python autograd package same as pytorch and also you will partially repeat what we have in the first two lesson. This info is base which must be detailed learned and there’s no sense to move forward to deep learning until you fully understand this consepts and materials. See you on the second lesson!