Implementation Of KNN(using Scikit learn,numpy and pandas)

by keshav

KNN classifier is one of the strongest but easily implementable supervised machine learning algorithms. It can be used for both classification and regression problems. If we try to implement KNN from scratch it becomes a bit tricky however, there are some libraries like sklearn in python, that allows a programmer to make a KNN model easily without using deep ideas of mathematics.

We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have altogether 150 observations(tuples) and we will make KNN classifying model on the basis of these observations. Link to download iris dataset- iris.csv

Let's see step-by-step how to implement KNN using scikit learn(sklearn).

Step-1: First of all we load/import our training data set either from a computer hard disk or from any url.

import pandas as pd# loading data file into the program. give the location of your csv file
dataset = pd.read_csv("E:/input/iris.csv")
print(dataset.head()) # prints first five tuples of your data.

Step-2: Now, we split data row-wise into attributes/features and their corresponding labels.

X = dataset.iloc[:, :-1].values # splits the data and make separate array X to hold attributes.
y = dataset.iloc[:, 4].values  # splits the data and makes a separate array y to hold corresponding labels.

Step-3: In this step, we divide our entire dataset into two subsets. one of them is used for training our model and the remaining one for testing the model. we divide our data into 80:20 i.e. first 80% of total data is training data and the remaining 20% is our test data. We divide both attributes and labels. We do this type of division to measure the accuracy of our model. This process of spiting our supplied dataset into training and testing subsets in order to know the accuracy and performance of our model is called cross-validation.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Step-4: In this step, we perform normalization/standardization. It is the process of re-scaling our data so that the variations present in our data will not affect the accuracy of the model. we have used the z-score normalization technique here. For more on normalization, click here.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Step-5: Now it's time to define our KNN model. We make a model, and supply attributes of the test subset for the prediction.

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=9) #defining KNN classifier for k=9.
classifier.fit(X_train, y_train) #learning process i.e. supplying training data to model
y_pred = classifier.predict(X_test) #stores prediction result in y_pred

Step-6: Since the test data we've supplied to the model is a portion of training data, so we have the actual labels for them. In this step, we find the magnitudes of some classification metrics like precision, recall, f1-score, etc.

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Step-7: supply actual test data to the model.

# testing model by suppplying ramdom data
x_random =  [[-1.56697667 , 1.22358774, -1.56980273, -1.33046652],
             [-2.21742620 , 3.08669365, -1.29593102,-1.07025858]]
y_random=(classifier.predict(x_random))
print(y_random)

Let's see the output of the above program.

sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
[[11 0 0]
[ 0 9 0]
[ 0 0 10]]

Classification metrices for test data:
precision recall f1-score support

Setosa 1.00 1.00 1.00 11
Versicolor 1.00 1.00 1.00 9
Virginica 1.00 1.00 1.00 10

micro avg 1.00 1.00 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

For actual test data:

['Setosa' 'Setosa']