# Supervised Machine Learning

by keshav ## What Is Supervised Learning?

It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data.

Supervised Learning algorithm has two categories i.e Classification & Regression

Classification predicts the class or category in which the data belongs to.

e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification.

Regression predicts a numerical value based on previously observed data.

e.g.: House Price Prediction, Stock Price Prediction.

## Classification

Classification is one of the widely and most used techniques for determining the class the dependent belongs to base on one or more independent variables. For simple understanding, what a classification algorithm does is it simply makes a decision boundary between data points (feature vectors) separating similar data points with dissimilar ones.

Some of the most common classification algorithms are discussed briefly below:

1. K-Nearest Neighbors (K-NN)

It is one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes in order to make predictions for new sample data points. It is a non-parametric, lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be an odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. overfitting of the model. In the same way, if ‘K’ is very large then it leads to High Bias, Low variance i.e. underfitting of the model. There are many types of research done on the selection of the right value of K, however, in most of the cases taking ‘K’ = square-root (total number of data ‘n’) gives a pretty good result. KNN works pretty well with a small number of input variables (p), but there are more chances of bad prediction when the number of inputs becomes very large.

1. Support Vector Machine (SVM)

Support Vector is one of the mathematically complex supervised learning algorithms used for both regression and Classification. It is strictly based on the concept of decision planes (most commonly called hyperplanes) that define decision boundaries for the classification. A decision plane is one that separates between a set of data having different class memberships. It performs classification by finding the optimal hyperplane that maximizes the margin between the two classes with the help of support vectors.

For linearly separable data, learning is done by finding an optimal hyperplane between the classes.

Kernel SVM

In the SVM algorithm, the kernel SVM takes a kernel function and transforms it into the required form that maps data to a higher dimension that can be separated.

Some of the most common types of kernel function are:

• Linear Kernel: K(Xi,Xj) = Xi.Xj
• Polynomial kernel: K(Xi,Xj) =( γXi.Xj+C)d , where d is the degree of the polynomial that should be specified.
• RBF Kernel: K(Xi,Xj) =exp(- γ|Xi -Xj|2), it is used for non-linearly separable variables. For distance metric squared Euclidean distance is used.
• Sigmoid kernel: K(Xi,Xj) =tanh( γXi.Xj+C), it is similar to logistic regression is used for binary classification

Kernel trick uses the kernel function to transform the data into a higher dimensional feature space to make it possible to perform the linear separation for classification.

So, it is better to use linear SVMs for linear problems, and non-linear kernels such as the sigmoid kernel, Radial Basis Function kernel for non-linear problems.

3.Naive Bayes

The naive Bayes classifier is based on Bayes’ theorem of probability. According to Bayes theorem, the probability that we want to calculate P(A|B) can be given in terms of P(A),P(B|A) and P(B) as, The principle of the Naïve Bayes classifier is that every feature being classified is independent of the value of any other.  A Naive Bayes model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets.

1. Decision Tree Classification

Decision trees are one of the strongest but simple supervised learning algorithms used for classification or regression in the form of a tree structure. So it is also called CART (Classification and Regression Trees).

The decision tree resembles with flowchart-like structure in which each node represents a ‘test’ on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. It follows the Iterative Dichotomiser 3(ID3) algorithm structure for determining the split of nodes.

ID3 algorithm uses Entropy and Information Gain to construct a decision tree. Entropy

In Layman terms, Entropy is measure of disorder or uncertainty. In Machine Learning, entropy is used to calculate the homogeneity of a sample. Lower is the entropy of sample higher is its homogeneity. In other words, entropy tells about the predictability of any event. It is denoted by H(S) or E(S)

The mathematical formula to calculate the entropy is as follows: Information Gain

Information gain is the important measure used by Decision Tree Algorithms to construct a Decision Tree. The decision Trees algorithm will always try to maximize Information gain.  An attribute with the highest Information gain will be tested/split first. Information gain is measured using the following formula: Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the Entropy of the entire set, while the second term calculates the Entropy after applying the feature X.