Logistic Regression

Logistic regression is one of the widely used classification algorithms in machine learning. It solves variety of use cases including credit card fraud detection, spam detection, customer attrition, etc. This blog is a continuation of the machine learning blog series and will explain logistic regression algorithm in detail.

Consider an example of a credit card fraud detection system. Many credit card companies lose billions of money due to fraud transactions. A transaction is considered to be fraud if it is not executed by the rightful credit card holder. To counter these frauds, banks install real time systems which detect anomalous transactions.

Spam detection is another use case which can be approached by logistic regression. Spam detection is very common among email services like gmail, outlook etc. It is basically a text categorization problem where features can be frequency of words, phrases mostly occur in spam mails like “Won”, “Lottery”, “Price” etc.

Logistic regression finds applicability in these and many more use cases.

Background

Logistic Regression (logit) is a statistical method to categorize data into classes. Like other regression models (linear regression), it is used in predictive analysis.

Logistic Regression can be binomial, multinomial or mixed. Binary logistic regression is a special case of regression in which dependent or outcome variable is dichotomous (binary). The goal of binary logit is to find best fitting model to categorize the data into classes optimally. In this post, we will focus on binary logit and how it can be used to solve spam detection.

In Fig 1.1 red balls are representing spam and green ones are non-spam where spaminess is a numerical entity which shows probability of an email being spam or non-spam.

logistic regression - plot of spam/ non-spam emails

Fig 1.1 Plot of spam/ non-spam email

As outcome of spam detection is binary, it can be represented as shown in Fig 1.2, taking 1 as spams and 0 for non-spams.

logistic regression: plot spam and non spam graph

Fig 1.2 Plot of spam/ non-spam email

Hypothesis

Hypothesis of logistic regression known as logistic function is represented as

$h_{\theta}(x) = \frac{1}{1 + e^{-\theta^^Tx}}$

$h_{\theta}(x) = g(\theta^TX)$

$\theta^TX = \theta_0 + \theta_1x_1 + \theta_2x_2 + ..... + \theta_nx_n$
Here, $h_\theta(x)$ is logistic function which can also be represented as $g(\theta^TX)$ . $\theta$ ’s are constant or coefficient of equation, $x$ ’s are features or attributes and $e$ is natural log. $\theta^TX$ is similar to linear regression hypotheses which is also called as decision boundary in between two class of data.

logistic regression - Decision boundary graph

Fig 1.3 Decision boundary graph

This logistic function varies from 0 to 1 which is considered as probability. This probability is then used to classify the data into two classes by setting a threshold value.

$g(\theta^TX)$ can be represented as shown in fig 1.4.

logistic regression: plot of logistic function

Fig 1.4 Plot of logistic function

Loss/Cost Function

Accuracy of logistic regression is measured by how a decision boundary classifies the instances. To find optimal decision boundary, we need to reduce the following loss function :
$cost(h_\theta(x),y)={-log(h_\theta(x)) if y =1} \parallel {- log(1 - h_\theta(x)) if y=0}$

Logistic regression - Cost function graph

Fig 1.5 Cost Function

As $h_\theta(x)$ is logarithmic function, squared loss will not be a convex function and gradient descent will not converge to minima we want. For logistic regression, a different loss function is derived which helps gradient descent to converge at local minima.

$J(\theta) = \frac{1}{m}\ \sum_{m}^{i=0}\ cost(h_\theta(x^{(i)})\ ,\ y^{(i)})$
$J(\theta) = \frac{1}{m}\ \sum_{m}^{i=0}\ \left \{ y^{(i)}log(h_\theta{(x^{(i)}}))+(1-y^{(i)})log(1-h_\theta(x^{(i)})) \} \ :\ \ y \ \varepsilon\ \{0,1\}$
Above cost function is convex in shape and where $J(\theta)$ is loss function, $m$ are number of instances, $y^{(i)}$ is actual value and $h_\theta(x^{(i)})$ is predicted value of dependent variable in $i^{th}$ instance. Now we can predict the classes using generated model $h_\theta(x)$ . Following fig 1.6 is showing the plot of $h_\theta(x)$ with data points as shown in fig 1.2.

Fig 1.6 Hypothesis function with features vs probabilities for spam detection problem

Now from calculated hypothesis we can find the probability of data point by putting its features value inside hypothesis $h_\theta(x)$ . This probability will help to classify email(data point) into spam and non-spam by deciding threshold such that email having probability above this threshold is considered to be spam and vice versa.

Gradient Descent

Gradient descent is optimization algorithm used to find the minima of given function. You can find the detailed explanation of gradient descent in Linear Regression blog.

Summary

In this blog, we explained logistic regression and its applicability in real world. We talked about logistic function which varies from 0 to 1 and can be derived using its loss function. A gradient descent method is used for finding the minimum loss.