Linear Regression

Machine learning is the science of getting computers to act without being explicitly programmed. Machine learning is so pervasive today that you probably see it dozens of times a day without knowing it. Recently we covered different techniques involved in machine learning. Here we will take a look at linear regression: a method of supervised learning in detail, along with real life use case..

Sales estimation

A common problem among e-commerce and product firms is revenue projections. Forecasting sales helps companies to take informed business decisions in achieving optimal cost cutting and expansions in specific areas.
Using regression analysis, one can predict the futuristic sales of company.

Shelf life of medicine

Ever wondered how the shelf life of medicine is determined? To derive the shelf life of medicine, a systematic stability testing is done by taking samples of medicine, putting those in different environments and extracting samples at predetermined time intervals. A regression analysis can help in determining the expected expiry time of the medicine using the data provided by this experiment.

Background

Regression analysis is a statistical method to derive a relationship between a set of variables: one dependent variable and other explanatory or independent variables. Independent variables are features or predictors which help to predict the value of dependent variable. Let us consider a basic example in which there is only one independent variable and one dependent variable with only two data points available.

straight line between two points

Fig 1.1

In the figure 1.1 there are two data points P1 & P2 joined by a line. This line can be represented as :

$y = mx +c$

Where y is an output or dependent variable, x is an explanatory variable, m is slope or regression coefficient and c is constant and intercept. With having only two data points in plane we can derive relationship using between these two points by drawing a single line. But in case of more than two non-collinear data points, it is not possible to cover all points by a single line, see following figure 1.2.

linear regression - line graph passing through two points Fig 1.2

In fig 1.2 all three data points P1, P2, P3 are noncollinear which cannot be covered by single line. We can see, that point P3 is not on line and if we project P3 vertically onto the above line then y3′ will be yielded instead of y3 and difference between them is considered as error which is quite high in this case. Through regression analysis we can draw more appropriate relationship (line) in between explanatory and dependent variables (compared to above line).

linear regression - line graph Fig 1.3

In figure 1.3, an alternative line is drawn to reduce the difference between actual and calculated y (Above line is only for understanding, might not be optimal).

Algorithm

Linear regression is a type of regression analysis, where the derived relationship between the variables is linear in nature. It is used in predictive modeling where predicted value will always be a real value. It helps in deriving a linear correlation between independent variables to get the optimal value of dependent variable. It is supervised learning algorithm where it learns from historical (training) data with dependent variable assigned.

Let us understand this with simple example of student marks prediction.

Sample data set for linear regression
Table shows the data of students with features like average hours of study, average last 5 test results etc and with last column as a dependent variable “marks” which is to be predicted. Plotting of above table on graph is difficult as it is having 3 features (explanatory variables) and can also be called as 3-dimensional problem.

For visualizing n-dimensional problem we use Euclidean space. Euclidean space is a representation of multidimensional point on 2-D space. The above data table can be represented as shown in fig 1.4 :

linear regression - eucledian space

Fig 1.4

Line in fig 1.4 will predict marks awarded to the student for the given features.

Hypotheses

Theory of linear regression involves multiple areas of mathematics like differential, algebra & geometry. In this regression we will try to predict the linear equation which will give optimum output variable. Let us start with our first assumption :-
With m instances of data and n number of features we take generalized linear model for creating line.

$h_{{\theta}}(x^i)} = \theta_{{0}} + \theta_{{1}}{x_{1}}^{(i)} + \theta_{{2}}{x_{2}}^{(i)} + .... + \theta_{{n}}{x_{n}}^{(i)}$
in which $\theta$ ’s are constant or coefficient of equation, x’s are features or attributes, i is an instance, n is total number of features and $h_{{\theta}}(x^i)}$ is predicted value. This equation is n-dimensional linear equation as all $\theta$ ‘s are are of degree one.

Loss Function

Linear regression algorithm will find optimal coefficients $\theta_0,\theta_1,\theta_2....\theta_n$ of hypothesis such that sum of difference between actual and predicted y will be least. This evaluated sum is considered as loss or error and can be mathematically formulated into sophisticated expression :

$J(\theta_{{0}},\theta_{{1}},....,\theta_{{n}}) = min_{{\theta}}(\frac{1}{2m})\sum_{m}^{i=1} (h_{{\theta}} (x^{(i)}) - y^{(i)})^2$
Above expression is also called as Loss Function. $h_{{\theta}}(x^i)}$ is the predicted value , $y^{(i)}$ is the actual value and m is number of instances. As from name it is depicting loss of our model which needs to be reduced. Among all loss functions, this is called as squared loss which is most commonly used in linear regression. With degree two, this loss function will plot a parabolic graph for one explanatory variable and convex shape for more than one explanatory variables. Since the loss function is parabolic in nature it assures that there will be only one minima which will be global minima. In figure 1.5 parabola is shown in which Y-axis is showing Loss and X-axis is representing coefficients.

linear regression- loss function curve Fig 1.5 Loss Function

Gradient Descent

Gradient descent is an optimization algorithm to find local minima of function. As shown in above image a parabola is drawn for Loss function equation for single variate linear regression. Minima of parabola is the minimum loss that can be achieved from this loss function. For finding the minima, “gradient descent” like technique is used

repeat until convergence {
${\theta_{i}}^{(j+1)} \leftarrow {\theta_{i}}^{(j)} - \alpha(\frac{\delta J}{\delta\theta_i})$
}

in which $\alpha$ is learning rate and $\frac{\delta J}{\delta\theta_i}$ is partial derivative of loss function with respect to $\theta_i$ .
For starting gradient descent first choose coefficients ${\theta_{0}}^{0}, {\theta_{1}}^{1}, {\theta_{2}}^{2},...., {\theta_{n}}^{n}$ randomly. Then calculate the thetas from above expression. Important point is that each variable should be updated simultaneously i.e. calculate all the $\theta$ ’s value at one time and update them in equation simultaneously and repeat until they converge i.e, two consecutive loss function will have same value or difference b/w them is very less. It is also called batch gradient descent. Now we have all $\theta$ ’s value and using this model we can predict value of test case by putting features in hypothesis.

alpha = 0.01
min_range = 0.00001
converge = False
thetas = [] 	# random thetas list
def derivative(theta):
	// Derivative of loss function which will take 
	// theta as input and find derivative w.r.t theta

def loss(thetas):
	// Get loss value by putting thetas in loss function

loss = loss(thetas)
While not converge:
	new_thetas = []	# list for new thetas
	for all thetas as theta:
		Add (theta - alpha * derivative(theta)) value into list-new_thetas
	# Update new thetas from old thetas
	thetas = new_thetas
	new_loss = loss(thetas)
	# Convergence condition
	if abs(new_loss - loss) <= min_range
		# converged successfully
		converge = True
	loss = new_loss

Summary

This blog attempts to explain linear regression using real world examples. Linear regression helps in deriving optimal linear relationship between variables using squared mean loss function. Gradient descent is used to calculate the coefficients of model such that loss will be minimum. Linear regression has vast varieties of use cases like risk estimation and trends evaluation in business and many more.