K-Nearest Neighbours is an extremely simple algorithm, where the prediction for a new instance
is the plurality class among its k closest neighbours.
To made it easy, let us say you want to decide whether you want to go on a holiday and are utterly confused between Italy, and Paris. You ask 100 people abou
t these options (I know it is unrealistic, but still assume), but then you pick up your closest 5 people, who have similar preference as you, and then decide on the basis of their majority opinion you decide where to go.
Theoretically speaking, for a new instance Xi its distance (Euclidean, Manhattan, Minkowski etc.) is calculated with all the points in the training data. k points having minimum distance with Xi are considered its 'closest neighbours'. The class having majority in these k neighbours is considered as the predicted output for Xi.
Example
In the following example we have 15 observations and 3 variables X1, X2, and X3. Decision is our dependent variable : Italy or Paris.
Now we have a new instance, and thus we calculate the Euclidean distance of this new instance with each of the 15 points.
Rank the observations in ascending order (last column).
Choose closes 'k' neighbours with least Euclidean distance (i.e., say for k = 5, we chose ranks 1-5)
Get the value of outcome variable for these 'k' neighbours (highlighted in orange).
With majority voting of these 'k = 5' neighbours, we saw that 3 observations have the value Italy and 2 have Paris. Thus, with majority voting we have chosen the prediction for new instance as Italy
Standardising the variables
Since the distance can be influenced by difference in the scales of the variables (say one is in thousands, and other variable always taking the values less than 100), thus the variables are standardised to have mean 0 and variance 1 before KNN is implemented.
In the above example, all of our variables had the values between 1 to 8, thus we did not standardise it, but in real life, we standardise the data before applying KNN.
How to choose the best value of 'k'
The value of 'k' can highly impact the outcome decision, thus, we need to hypertune it. We can find the best value of 'k' by using K-Fold cross validation.
For more information on K-Fold cross validation you can refer to this article : K-Fold cross validation made easy.
Python Code
To run KNN in Python we will use iris dataset, which is inbuilt in Python. Iris dataset comprises of data for 150 flowers belonging to 3 different species: Setosa, Versicolor and Virginica. For these 150 flowers their Sepal Length, Sepal Width, Petal Length and Petal Width information is available.
Let us firstly load pandas library
import pandas as pd
Now we will load iris dataset from sklearn library
from sklearn.datasets import load_iris
iris = load_iris()
Following are the variable names in iris dataset
iris.feature_names
Output:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
Now we are storing the independent variables from iris dataset in X and dependent variable in y
X = iris.data
y = iris.target
We can see from the shape that X has 150 rows and 4 columns
X.shape
Output:
(150, 4)
We can see the number of occurences of different species:
pd.Series(y).value_counts()
Output:
2 50 1 50 0 50 dtype: int64
Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 120 rows and test set has 30 rows.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape);
print(X_test.shape);
print(y_train.shape);
print(y_test.shape)
Output:
(120, 4) (30, 4) (120,) (30,)
Let us build our KNN model for k = 5
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
We are not fitting KNN to training data
knn.fit(X_train,y_train)
Using predict( ) function we are making predictions for both training and test set
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)
To test the performance of KNN for k = 5 on training set and test set, we are using confusion matrix and accuracy.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
We are now generating confusion matrix for training set. We can see that there are 4 misclassifications in training set.
confusion_matrix(y_train, pred_train)
Output: array([[40, 0, 0], [ 0, 38, 3], [ 0, 1, 38]], dtype=int64)
Now we are calculating the accuracy on training set. We are getting almost 97% accuracy on our training data.
accuracy_score(y_train,pred_train)
Output: 0.9666666666666667
Now we are generating the confusion matrix for test set. We can see that there are no misclassfications in our data.
confusion_matrix(y_test,pred_test)
Output:
array([[10, 0, 0], [ 0, 9, 0], [ 0, 0, 11]], dtype=int64)
Now we are calculating the accuracy on testset. We are getting 100% accuracy on our test data.
accuracy_score(y_test,pred_test)
Output: 1
KNN using cross validation
Above code holds true for a single value of 'k'. To find the best value of 'k' we can use cross validation.
from sklearn import metrics
from sklearn.model_selection import cross_val_score
We are specifying 5-fold cross-validation with K=5 for KNN
knn = KNeighborsClassifier(n_neighbors=5)
Now we are fitting the KNN for n_neighbours = 5 on our training set. Following are the accuracy numbers for each of the 5 folds.
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
print(scores)
Output: [0.95833333 0.95833333 0.83333333 1. 0.95833333]
We are now calculating the average accuracy for all the 5 folds of training data
print(scores.mean())
Output: 0.9416666666666667
To find for an optimal value of 'k' for KNN, we are creating a list of values for 'k' = 1 - 30. and then we will calculate the average score for 5 fold cross validation for each of the value of 'k'. The value of' 'k' which will have highest average accuracy across 5 folds will be the optimal value for 'k'.
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
On plotting the average accuracy across 5-folds, for each of the 'k' we can see that for k = 3, we are getting the best average accuracy.
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(k_range, k_scores)
plt.xlabel('k')
plt.ylabel('Cross-Validated Average Accuracy')
KNN Using Grid search
The most optimal way to find the best value of 'k' is to use GridSearch and not a for loop for cross validation.
from sklearn.model_selection import GridSearchCV
We are now defining the range of 'k' = 1 to 30 as a parameter for our grid search
k_range = list(range(1, 31))
print(k_range)
We are now creating a parameter grid, where n_neighbors is one of the parameters in KNN algorithm, and we are specifying that our k_range (1-30) would be the values used for finding the best 'k'.
param_grid = {'n_neighbors' : k_range}
print(param_grid)
We are now running the grid search for 5-fold cross validation.
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
The best estimator using grid search is k = 3, leading to a best score of 95.9%
grid.best_estimator_
Output: KNeighborsClassifier(n_neighbors=3)
grid.best_score_
Output: 0.9583333333333334
grid.best_params_
Output: {'n_neighbors': 3}
Now we are fitting out final KNN model using our best estimator i.e., KNN for k = 3
knn = grid.best_estimator_
knn.fit(X_train,y_train)
We are now making the predictions on both training and test set
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)
We can see that for k=3 we have a test set accuracy of 1
accuracy_score(y_test,pred_test)
Output: 1