Understanding K-Nearest Neighbor Algorithm (With Examples)

by keshav


cover.jpg

It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make predictions for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance).

Principle: K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.”

In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be an odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. overfitting of the model. In the same way, if ‘K’ is very large then it leads to High Bias, Low variance i.e. underfitting of the model. There are many types of research done on the selection of the right value of K, however in most of the cases taking ‘K’ = {square-root of (total number of data ‘n’)} gives a pretty good result. If the value ‘K’ comes to be odd then it’s all right else we make it odd either by adding or subtracting 1 from it.

Classification-KNN

K-NN works pretty well with a small number of input variables (p), but there are more chances of error in prediction when the number of inputs becomes very large.

 

It is based on the simple mathematics that we used at the high school level to measure the distance between two data points in the graph. Some of the distance measuring techniques (Formulae) that we can use for K-NN classification are:

 

  • Euclidean Distance:

Euclidean-distance-formula

  • Manhattan Distance:

Manhatten-distance-formula

  • Minkowski Distance:

Minkowski-distance-formula

For p=1, we get Manhattan Distance and for p=2, we get Euclidean Distance. So, we can say that Minkowski distance is the generalized form of Manhattan Distance, Euclidean Distance.

 

Among these methods, the Euclidean Distance method is widely used.

 

Algorithm for K-NN:

  1. Load the given data file into your program
  2. Initialize the number of neighbors to be considered i.e. ‘K’ (must be odd).
  3. Now for each tuple (entries or data point) in the data file we perform:
    1. Calculate the distance between the data point (tuple) to be classified and each data points in the given data file.
    2. Then add the distances corresponding to data points (data entries) in the given data file (probably by adding a column for distance).
    3. Sort the data in the data file from smallest to largest (in ascending order) by the distances.
  1. Pick the first K entries from the sorted collection of data.
  2. Observe the labels of the selected K entries.
  3. For classification, return the mode of the K labels and for regression, return the mean of K labels.

Decision Boundary for K-NN:

Decision-Boundary-KNN

 

Implementation of KNN from Scratch:

There are some libraries in python to implement KNN, which allows a programmer to make a KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky.

Following is the example implementation of the KNN algorithm from scratch. We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have altogether 150 observations(tuples) and we will make KNN classifying model on the basis of these observations.

Link to download iris dataset- iris.csv

 

import pandas as pd
import numpy as np
import operator

# loading data file into the program. give the location of your csv file
dataset = pd.read_csv("E:/input/iris.csv")
print(dataset.head()) # prints first five tuples of your data.

# making function for calculating euclidean distance
def E_Distance(x1, x2, length):
    distance = 0
    for x in range(length):
        distance += np.square(x1[x] - x2[x])
    return np.sqrt(distance)

# making function for defining K-NN model

def knn(trainingSet, testInstance, k):
    distances = {}
    length = testInstance.shape[1]
    for x in range(len(trainingSet)):
        dist = E_Distance(testInstance, trainingSet.iloc[x], length)
        distances[x] = dist[0]
    sortdist = sorted(distances.items(), key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(sortdist[x][0])
    Count = {}  # to get most frequent class of rows
    for x in range(len(neighbors)):
        response = trainingSet.iloc[neighbors[x]][-1]
        if response in Count:
            Count[response] += 1
        else:
            Count[response] = 1
    sortcount = sorted(Count.items(), key=operator.itemgetter(1), reverse=True)
    return (sortcount[0][0], neighbors)

# making test data set
testSet = [[6.8, 3.4, 4.8, 2.4]]
test = pd.DataFrame(testSet)

# assigning different values to k
k = 1
k1 = 3
k2 = 11

# supplying test data to the model
result, neigh = knn(dataset, test, k)
result1, neigh1 = knn(dataset, test, k1)
result2, neigh2 = knn(dataset, test, k2)

# printing output prediction

print(result)
print(neigh)
print(result1)
print(neigh1)
print(result2)
print(neigh2)

The Output of above program is:

  sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
4
4
4
Virginica
[141]
Virginica
[141, 145, 110]
Virginica
[141, 145, 110, 115, 139, 147, 77, 148, 140, 112, 144]

 

Implementation of KNN using Sklearn: 

If we try to implement KNN from scratch it becomes a bit tricky however, there are some libraries like sklearn in python, that allows a programmer to make a KNN model easily without using deep ideas of mathematics. 

We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have altogether 150 observations(tuples) and we will make KNN classifying model on the basis of these observations. Link to download iris dataset- iris.csv

Let's see step-by-step how to implement KNN using scikit learn(sklearn).  

Step-1: First of all we load/import our training data set either from a computer hard disk or from any url.

import pandas as pd# loading data file into the program. give the location of your csv file
dataset = pd.read_csv("E:/input/iris.csv")
print(dataset.head()) # prints first five tuples of your data.


Step-2:
Now, we split data row-wise into attributes/features and their corresponding labels.

X = dataset.iloc[:, :-1].values # splits the data and make separate array X to hold attributes.
y = dataset.iloc[:, 4].values  # splits the data and makes a separate array y to hold corresponding labels.

 

Step-3: In this step, we divide our entire dataset into two subsets. one of them is used for training our model and the remaining one for testing the model. we divide our data into 80:20 i.e. first 80% of total data is training data and the remaining 20% is our test data. We divide both attributes and labels. We do this type of division to measure the accuracy of our model.  This process of spiting our supplied dataset into training and testing subsets in order to know the accuracy and performance of our model is called cross-validation.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

 

Step-4: In this step, we perform normalization/standardization. It is the process of re-scaling our data so that the variations present in our data will not affect the accuracy of the model. we have used the z-score normalization technique here. For more on normalization, click here.

 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


Step-5: 
Now it's time to define our KNN model. We make a model, and supply attributes of the test subset for the prediction.

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=9) #defining KNN classifier for k=9.
classifier.fit(X_train, y_train) #learning process i.e. supplying training data to model
y_pred = classifier.predict(X_test) #stores prediction result in y_pred

 

Step-6: Since the test data we've supplied to the model is a portion of training data, so we have the actual labels for them. In this step, we find the magnitudes of some classification metrics like precision, recall, f1-score, etc. 

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

 

Step-7: supply actual test data to the model. 

# testing model by suppplying ramdom data
x_random =  [[-1.56697667 , 1.22358774, -1.56980273, -1.33046652],
             [-2.21742620 , 3.08669365, -1.29593102,-1.07025858]]
y_random=(classifier.predict(x_random))
print(y_random)

 

Let's see the output of the above program.

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
[[11  0  0]
 [ 0  9  0]
 [ 0  0 10]]


Classification metrices for test data:
              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        11
  Versicolor       1.00      1.00      1.00         9
   Virginica       1.00      1.00      1.00        10

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

For actual test data:

['Setosa' 'Setosa']

 

Advantages and Disadvantages of K-NN:

Advantages:

  1. The Kalgorithm is quite easy and simple to implement as it does not include much mathematics.
  2. We can do solve both classification and regression problems using the K-NN algorithm.

Disadvantages:

  1. The algorithm becomes highly slower as the size of data increases.

 


No Comments


Post a Comment