Jun 17

Introduction To Machine Learning

Humans have the ability to learn and take decisions – some of these are logical while some are fuzzy by nature. Some of these abilities can be modelled as complex mathematical equations which mimic the human behaviour. As humans learn from past experience, machines learn from historic/past data. To make these models efficient we need to process large amounts of data, which cannot be done by humans in real time. That’s where machines are required. Machine learning is a field of artificial intelligence in which computers build the ability to learn and take decisions without being explicitly programmed.

In this blog, we will explain basic machine learning techniques and real world use cases that can be solved by these techniques. Before elaborating, here are some basic machine learning terminologies:

  • Model – Models are mathematical expressions which are generated by learning algorithms to yield results.
  • Features – Independent variables which help the algorithm to yield the output variable. Also known as explanatory variables or predictors.
  • Output Variable – Unknown variable, which learning algorithm attempts to estimate. Also known as dependent or target variable.
  • Labeled Data – Data in which Output variable is present.
  • Unlabeled Data – Data in which Output variable is not present.

Supervised Learning

In this category of learning, algorithms train themselves using data annotated with the expected results for each input in the data. Such data along with the expected result is called supervised. Supervised learning can further be categorized into regression and classification problems.

Regression

Regression is a statistical method of formulating relationships among variables.

Regression has some very popular use cases: one of which is to find the shelf life of a medicine by assessing the stability of active component in drugs with the passage of time. The lifetime of the medicine is affected by change in temperature, oxidation, light(photo reaction), etc. Potency of medicine, which decreases with time, is then calculated using these influential factors. Regression method is used to derive the relationship between potency and time.

Some of the popular regression algorithms include Linear regression, Nonlinear regression, Least absolute deviations etc.

Classification

Classification is a technique to assign a class (from fixed set of classes) to the input data.

Consider spam detection in emails: an email is considered to be a spam if it contains any irrelevant and unsolicited messages. Spam detection is a classification problem in which output variable can take only two classes as value – spam or non-spam. Popular email services like gmail, outlook etc. have their own intelligent systems which categorize emails as spam or non-spam.

As another example of classification consider how opinions can be built from product reviews. Product reviews express opinions in textual forms. These reviews can help a company to judge the overall sentiments of public towards their product and can work towards its improvements. Opinions can be of multiple types : satisfactory, unsatisfactory, good, bad etc. So it is considered to be a classification problem, in which set of opinions are considered as classes.

Some of the popular classification algorithms include: Logistic Regression, Naive Bayes’, Support Vector Machine etc. Even though logistic regression is one kind of regression analysis which yields a real value but using this value we can infer classes.

Unsupervised Learning

This category of learning is used where the training data doesn’t contain output or target variable. As the data doesn’t have any predefined output, the algorithm tries to detect patterns and characteristics from the given data set.
There are many approaches to unsupervised learning out of which the most common method is “cluster analysis” which involves grouping of similar data and later draw inferences based on these groups. All data points in a group are considered to have same characteristics.

Collecting similar news over internet can be solved by clustering algorithms. Textual news have some specific words which define the characteristic of the news article. Suppose if a news article includes topics like election 2016, lower house seats, parliament etc then it might be a political news. Grouping of news includes finding patterns in news, cluster the similar news and showcase its category. Google news is an example of news clustering which gathers similar news from all news sources and displays it on a single page.

Popular unsupervised learning algorithms include K-means clustering, Mixture models, hierarchical clustering etc.

Semi-Supervised Learning

In this learning, data is partially supervised and partially unsupervised. We know unlabeled data is very common and labeled data is hard to get and human annotations are not the solution for large dataset.

Speech to text conversion is a problem in which spoken words are to be converted into text. Collecting speech records is not a big task but labeling it into transcript requires human intervention. This problem is semi-supervised learning as some of speeches are having transcripts and some not.

There are some heuristic techniques available to approach these problems. One of the technique, involves applying supervised algorithms on labeled data and use generated model to assign labels to unlabeled instances. And other is to generate clusters among points and assign label which occurred most in that cluster. Other common algorithms are : Generative models, Low density separations etc.

Summary

Machine learning solves two broad category of problems based on data: Supervised and Unsupervised. Supervised learning can further be divided into two set of problems : Classification and Regression on the basis of output variable.

This was a quick snapshot of Machine learning and different type of problems that can be approached using machine learning. We will cover each of the algorithms in detail in the upcoming posts.

Leave a reply

Your email address will not be published.