Linear Regression using Gradient Descent

Regression Analysis is a statistical process that involves identifying the relationship between independent and dependent attributes. The first studies of it date to the early 19th Century by Legendre using Least Square Methods. Since then, various methods of estimating the coefficients of regression were developed. The purpose of this article is to build and understand the internal working of linear regression using a gradient descent approach.  The goal is two-fold- a) to give an intuitive understanding of linear regression internals using gradient descent approach and b) To be comfortable with the concept of gradient descent which is very most frequently used for building Neural Network architectures.

Linear Regression

Assume that we have data that has m data points and n features/attributes. Therefore, our data is now a matrix of size mXn. But, for computational purposes, it is better to arrange the samples by columns and features/attributes as rows

Click here to be part of INSOFE’s exciting research through our doctoral program for working professionals – World’s first Doctorate in Business Administration (DBA) in Data Science

Let’s understand our goal- in a supervised approach given X, we need to predict y. If the model is good, then the difference between the actual and the predicted values would be small else the difference would be large. This difference is known as error. For a regression problem, the difference is computed as a squared error. The squared error is chosen because of the convex nature of the function.

For each data point i, the regression equation that predicts the outcome is y ̂ = w1X1 + w2X2 + …… + wmXm + C. From the equation it is evident that if there are m independent attributes, then we need m W’s (weights/coefficients which signify the importance of X’s) and a constant/intercept. This y ̂ is technically a dot product between a vector of W’s and vector of X’s (remember vector of X’s are nothing but the value of each attribute for that data point. So, it can be vectorially represented as

where w1, w2 …. are the weights of the features. Xnm represents the nth feature value of mth data point, b is the intercept value to be added.

Observe the dimensions of the W’s matrix  (1 X n), samples matrix (n X m), intercept ( 1 X m) and predictions matrix (1 X m). Mathematically, the equation is written as

Here to make predictions, we have X’s (from the data) and all we need to compute are W’s and b. Now, the question boils down to how to get them such that they make a good model (the error is small). The loss function is defined as the error for a single data point is
 (yi-y ̂i)2 .  Since we focus on the overall error on the data and not for an individual data point, this is computed as

where m is the number of data points. This is known as cost function and it should be minimized. The parameters are W’s and b (from which y ̂ is computed) which impact the cost. 

Here is a step-by-step procedure to solve the problem

  1. We start with random initialization of W’s and b. Let’s say all W’s and b’s be initialized with 0
  2. With these initialized values, we compute the y ̂ using (eq. 1).
  3. Then compute the cost function using (eq.2)
  4. Observe that this cost function is impacted/influenced by all the W’s and b. Now, to what extent each of Wi or b are affecting the cost function i.e. How much is the impact of W1 on cost, how much is the impact of W2 on cost and so on. How can we compute it?
    a. This can be done by computing a partial derivative of the cost function w.r.t each of the Wi.
    b. This is important because we can use this computed value to update corresponding Wi and b, and the updated W and b are used to compute the y ̂ again. The intuition is that when the loss is used to update W’s and b, we are driving the W’s and b in the direction of minimizing the cost. This approach is known as gradient descent.
Figure1: Appropriate learning rate would help us reach the minimum quickly

Here is the ideal W is where cost is minimum. Observe that, this is for one variable Wi (hence is represented in a 2-D) but in our case, there are multiple Ws and hence the plot is difficult to put on paper, but the intuition remains the same.

5. So how do we update the weights?
a. As mentioned previously, we have computed the partial derivatives of cost w.r.t W’s and b. During the update phase, we need to decide how much should we update. Should add/subtract the entire value (∂cost/ ∂Wi) for all i which is equal to the number of features/attributes from their corresponding Wi or only a part of it?

i. Imagine if we need to reach point B from point A. If we take large strides, then there is a possibility that we might cross point B and if we take very small strides, then it might take a long time to reach point B. So, our step size should be appropriate. This step size in this context is known as Learning Rate (denoted by α).

Figure 2: Large learning rate: The values of W may oscillate instead of reaching the minimum cost function value.
Figure 3: Small learning rate: The descent would be slow and takes a greater number of iterations for reaching the minimum value

b. We now update the value of W’s and b as follows
              Wupdated(i) = Wi  – α*∂cost/ ∂Wi
               bupdated = bi – α*∂cost/ ∂b                                         (eq. 3 & 4)

c. Let’s assume we have a function z= xy. Then ∂z/ ∂x =y ∂x/ ∂x   and ∂z/ ∂y = x ∂y/ ∂y .  You might know this from differential calculus. Similarly, if W2 is coefficient for feature X2 then ∂cost/ ∂W2 = X2 and so on.

Here “*” represents the matrix multiplication between the matrix X and the vector of residuals or errors for all the data points (observe that the Y are in upper case which implies that it is a vector of y’s of all data points)  and similarly

6. With the updated values of W’s and b, the Y ̂ are predicted and cost is computed from (eq. 1) and (eq. 2) respectively. Then the W’s and b are updated using (eq. 3 & 4) and this process continues iteratively, till cost function is minimized.

7. The values of W’s and b at the least cost function value would then be the coefficients of the features/attributes and intercept respectively as shown in image 1.

Algorithm for Linear Regression using Gradient Descent

Step 1: Initialize W’s and b

Step 2: Compute the y ̂ using the function WTX + b 

Step 3: Compute the cost from the actual and predicted values using function

Step 4: Calculate gradient of cost function w.r.t w’s and b. Then update the w’s and b using the equations               

Wupdated(i) = Wi  – α*∂cost/ ∂Wi
bupdated = bi – α*∂cost/ ∂b

Step 5:  Repeat Step 2 to Step 4 until the cost is minimized

Implementation of Linear Regression using Gradient Descent in Python. Here is the Github link for code

Leave a Reply

Your email address will not be published. Required fields are marked *