# Dimension Reduction Techniques (PCA vs LDA) in Machine Learning – Part 2

In the previous post (Part 1), we discussed about the benefits of dimension reduction and provided an overview of dimension reduction techniques. We will now discuss in detail the two key techniques for dimension reduction i.e.

1. PCA (Principal Component Analysis)

2. LDA (Linear Discriminant Analysis)

For example references, Wine dataset from the UCI Machine Learning Repository would be used. This dataset, which is fairly small, has three classes of points and a thirteen-dimensional feature set (i.e. thirteen different features like pH value, alcohol content and so on to predict the type/quality of wine).

###### How Does PCA Work?

Principal Component Analysis (PCA) is an unsupervised learning algorithm as it ignores the class labels (the so-called principal components) that maximize the variance in a dataset, to find the directions. In other words, PCA is basically summarization of data. For example, to obtain quality/type of wine, we can use different features of wine such as its pH value, alcohol content, colour of the wine, acidity content and so on, however, many of these features will be redundant or dummy feature variables (can be derived from other features), therefore causing to train model on unnecessary features. In short, we can get the type of wine with less feature variables and this is what actually PCA does inside the box.

Note that, PCA does not select a set of features and discard other features, but it infers some new features, which best describe the type of class (in our case – type of wine) from the existing features.

Previously, while deriving a formal definition, we came up with a phrase – maximize the variance in a dataset. Now the question arises – what the word ‘variance’ has to do with PCA? Remember, our main task is to define a feature set which distinguishes one type of wine from another. Imagine that you land up with a set of features that are unable to distinguish the types of wine, therefore, these set of features are useless. This type of dimensionality reduction will deteriorate your model accuracy and, in cases will lead to under-fitting of data. Therefore, *PCA looks for properties that show as much variation across datasets as possible.*

Now imagine that we compose a square matrix of numbers that describe the variance of the data, and the covariance among variables. This is the covariance matrix. It is an empirical description of data we observe.

PCA works on eigenvectors and eigenvalues of the covariance matrix, which is the equivalent of fitting those straight, principal-component lines to the variance of the data. Why? Because eigenvectors trace the principal lines of force, In other words, PCA determines the lines of variance in the dataset which are called as principal components with the first principal component having the maximum variance, second principal component having second maximum variance and so on.

[Do refer our next blog Machine Learning Cheatsheet for more details on terms eigenvalues / eigennvectors]

Now let’s jump to the implementation of PCA using sklearn on the example of wine dataset and understand how the data is transformed from a higher dimensionality to lower dimensionality.

Let’s Perform the PCA on wine dataset and analyse the graph:

1. Loading the wine dataset to the memory and performing feature extraction.

2. Scaling the dataset – Can use min-max normalization for scaling the dataset with mean zero and a unit standard deviation.

3. Apply PCA with inbuilt PCA function in sklearn:

a. from sklearn.decomposition import PCA

b. pca = PCA (n_components=2)

c. X_feature_reduced = pca.fit(X).transform(X)

Here, ‘X’ is the training dataset and ‘n_component’ is the number of PCA components we want to derive from our existing feature set.

So using the sklearn, PCA is like a black box (black box working explained above), you give scaled feature set as an input to sklearn PCA and get PCA components as output which can be used as an input to data training algorithms.

The graph above shows the plotting of data points using two features vs. Plotting of data points using principal components (which is basically summarization of all the features in our dataset) i.e. PCA1 and PCA2. This helps us to identify one more use of PCA i.e. proper visualization of our dataset along with their class tags.

If you look at the graph, you can visualize the whole dataset properly with only two feature variables as compared to having thirteen feature variables. Though they do not differentiate well among the classes but helps in reducing the feature set with minimal loss of information.

###### How Does LDA Work?

Linear Discriminant Analysis is a supervised algorithm as it takes the class label into consideration. It is a way to reduce ‘dimensionality’ while at the same time preserving as much of the class discrimination information as possible.

LDA helps you find the boundaries around clusters of classes. It projects your data points on a line so that your clusters are as separated as possible, with each cluster having a relative (close) distance to a centroid.

So the question arises- how are these clusters are defined and how do we get the reduced feature set in case of LDA?

Basically LDA finds a centroid of each class datapoints. For example with thirteen different features LDA will find the centroid of each of its class using the thirteen different feature dataset. Now on the basis of this, it determines a new dimension which is nothing but an axis which should satisfy two criteria:

1. Maximize the distance between the centroid of each class.

2. Minimize the variation (which LDA calls scatter and is represented by s2), within each category.

Therefore the gist is:

(meana – meanb)/(Sa – Sb) = ideally large/ideally small

Note: Here ‘mean’ is nothing but the centroid of the class. Variation is nothing but the spread of data over the plane. So, if the variation of data is minimum, then less overlapping between the classes will be there and maximum separation will be maintained between the different classes.

So whichever co-ordinate of the new axis satisfies these two criteria, they form the new dimension of dataset.

Congratulations, you are now a master to the internal working of LDA.

Now let’s jump to the implementation of LDA using sklearn on wine dataset and see how the data is transformed from a higher dimensionality to lower dimensionality.

Let’s perform the LDA on wine dataset and analyse the graph:

1. Loading the wine dataset to the memory and performing feature extraction.

2. Scaling the dataset – Can use min-max normalization for scaling the dataset with mean zero and a unit standard deviation.

3. Apply LDA with inbuilt LDA function in sklearn

a. from sklearn.decomposition import LDA

b. lda = LDA(n_components=2)

c. X_feature_reduced = lda.fit(X).transform(X)

Below is the graph of two LDA components (LDA1 and LDA2) obtained by applying LDA to wine dataset. (plotted using matplotlib scatterplot)

In the graph, you can visualize the whole dataset properly differentiate well among the classes. This is the major difference between PCA and LDA.

To conclude, PCA performs better in case where number of samples per class is less. Whereas LDA works better with large dataset having multiple classes; class separability is an important factor while reducing dimensionality.

For any queries or questions connect with us on marketing@vfirst.com

**Published on:**9 February, 2018

TAGS

best explanation so far……