Let us say we are given a data where the variables have different scales (eg., transaction amount is in 1000s, while quantity is less than 10). thus it might happen that variables with higher magnituge (eg. transaction amount) are given more weightage due to their scale. This can adversely impact the calculation of Euclidean distance or can lead to unreliable coefficients while creating any model.
Due to these reasons some algorithms (eg. Hierarchical clustering, K-means algorithm etc.) require the variables to be standardised or normalised i.e., have a same scale or do not have much variation.
There are 2 methods of scaling:
Standardisation
Normalisation
In this tutorial we will understand both of these concepts and will learn how to implement them in Python.
Standardisation:
To standardise a variable, we firstly calculate the mean and standard deviation of that variable. Then,from each observation we are subtracting this mean and hence divide this difference by the standard deviation calculated. Statistically, these are called Z scores. After standardising, these Z scores follow a normal distribution with mean 0 and variance 1.
Let us say our series is X, with mean µ and standard deviation σ then standardised series Z is given by:
Let us say our series is X = 1,2,3,4,5.
Here µ = 3 and σ = 1.41
Then Z = (X - 3) / 1.41 = -1.41, -0.7, 0 , 0.7, 1.41
Normalisation:
Some algorithms (Eg. Convolutional Neural Networks) require our variables to take values between 0 and 1. Thus, for this we normalise the observations by:
LEt us say we have a series X = 1,2,3,4,5.
min(X) = 1, and max(X) = 5
then Z = (X-1) / (5-1) = 0, 0.25, 0.5, 0.75, 1
Thus if a series is normalised then its minimum value will be 0 and maximum will be 1.
Python Code:
Let us firstly import all the necessary libraries:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler,MinMaxScaler
For this tutorial we would be using Python's inbuilt wine dataset. With the following code we are saving our wine dataset in X
wine = datasets.load_wine()
X = wine.data
X
Output:
array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00, 1.065e+03], [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00, 1.050e+03], [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00, 1.185e+03], ..., [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00, 8.350e+02], [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00, 8.400e+02], [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00, 5.600e+02]])
Standardisation Using StandardScaler
Now we are firstly standardising our variables using StandardScaler( )
With the below code we are initialising our standardscaler as scaler.
scaler = StandardScaler()
Now we are fitting our scaler on the original X and transforming it. Our results would be stored in X_scaled.
X_scaled = scaler.fit_transform(X)
X_scaled
Output:
array([[ 1.51861254, -0.5622498 , 0.23205254, ..., 0.36217728, 1.84791957, 1.01300893], [ 0.24628963, -0.49941338, -0.82799632, ..., 0.40605066, 1.1134493 , 0.96524152], [ 0.19687903, 0.02123125, 1.10933436, ..., 0.31830389, 0.78858745, 1.39514818], ..., [ 0.33275817, 1.74474449, -0.38935541, ..., -1.61212515, -1.48544548, 0.28057537], [ 0.20923168, 0.22769377, 0.01273209, ..., -1.56825176, -1.40069891, 0.29649784], [ 1.39508604, 1.58316512, 1.36520822, ..., -1.52437837, -1.42894777, -0.59516041]])
Now we are trying to observe the mean for all the variables in X. It will be close to 0
for i in range(X_scaled.shape[1]):
print(np.round(X_scaled[:,i].mean()))
Output:
0.0
0.0
-0.0
-0.0
-0.0
-0.0
0.0
-0.0
-0.0
-0.0
0.0
0.0
-0.0
Now we are trying to observe the standard deviation for all the variables in X. It will be close to 1.
for i in range(X_scaled.shape[1]):
print(np.round(np.sqrt(X_scaled[:,i].var())))
Output:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Note, the scale of minimum and maximum value in our dataset is now low:
X_scaled.min()
Output: -3.6791622340370145
X_scaled.max()
Output: 4.371372139554767
Normalisation Using MinMaxScaler
Now we are firstly normalising our variables using MinMaxScaler( )
With the below code we are initialising our MinMaxScaler as scaler.
scaler = MinMaxScaler()
Now we are fitting our scaler on the original X and transforming it. Our results would be stored in X_scaled.
X_scaled = scaler.fit_transform(X)
X_scaled
Output:
array([[0.84210526, 0.1916996 , 0.57219251, ..., 0.45528455, 0.97069597, 0.56134094], [0.57105263, 0.2055336 , 0.4171123 , ..., 0.46341463, 0.78021978, 0.55064194], [0.56052632, 0.3201581 , 0.70053476, ..., 0.44715447, 0.6959707 , 0.64693295], ..., [0.58947368, 0.69960474, 0.48128342, ..., 0.08943089, 0.10622711, 0.39728959], [0.56315789, 0.36561265, 0.54010695, ..., 0.09756098, 0.12820513, 0.40085592], [0.81578947, 0.66403162, 0.73796791, ..., 0.10569106, 0.12087912, 0.20114123]])
Now, if we get the minimum value in all of the columns, then it should be 0
for i in range(X_scaled.shape[1]):
print(np.round(X_scaled[:,i].min()))
Output:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Similarly, the minimum value in all of the columns, then it should be 1.
for i in range(X_scaled.shape[1]):
print(np.round(X_scaled[:,i].max()))
Output:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Comentarios