Principal Components Analysis

We use here an example of decathlon data which refers to athletes' performance during two athletic meetings.
You can load the data set as a text file here.

Presentation of the data

The data set is made of 41 rows and 13 columns.
The columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for the 10 events of the decathlon and the columns 11 and 12 correspond respectively to the rank and the points obtained. The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Game or 2004 Decastar).

Dataset decathlon click to view

To load the package FactoMineR and the data set, write the following line code: library(FactoMineR)
data(decathlon)

Objectives

PCA allows to describe a dataset, to summarize a dataset, to reduce the dimensionality.
We want to perform a PCA on all the individuals of the data set to answer several questions:

  1. Individuals' study (athletes' study): two athletes will be close to each other if their results to the events are close. We want to see the variability between the individuals. Are there similarities between individuals for all the variables? Can we establish different profiles of individuals? Can we oppose a group of individuals to another one?
  2. Variables' study (performances' study): We want to see if there are linear relationships between variables. The two objectives are to summarize the correlation matrix and to look for synthetic variables: can we resume the performance of an athlete by a small number of variables?
  3. Link between this two studies: can we characterize groups of individuals by variables?

PCA

We are going to study athlete's profiles according to their performances only. The active variables will thus be only those which concern the ten events of the decathlon.
The other variables ("Rank", "Points" and "Competition") do not belong to this athletes' profiles and use an information already given by the other variables (in the case of "Rank" and "Points") but it is interesting to confront them to the principal components. We will use them as supplementary variables.

Here the variables are not measured in the same units. We have to scale them in order to give the same influence to each one.

Active individuals and variables

We type the following line code to perform a PCA on all the individuals, using only the active variables, i.e. the first ten: res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T) #decathlon: the data set used
#scale.unit: to choose whether to scale the data or not
#ncp: number of dimensions kept in the result
#graph: to choose whether to plot the graphs or not

Principal Component Analysis: individuals scatterplot click to view Principal Component Analysis: variables scatterplot

The first two dimensions resume 50% of the total inertia (the inertia is the total variance of dataset i.e. the trace of the correlation matrix).

The variable "X100m" is correlated negatively to the variable "long.jump". When an ahtlete performs a short time when running 100m, he can jump a big distance. Here one has to be careful because a low value for the variables "X100m", "X400m", "X110m.hurdle" and "X1500m" means a high score: the shorter an athlete runs, the more points he scores.

The first axis opposes athletes who are "good everywhere" like Karpov during the Olympic Games between those who are "bad everywhere" like Bourguignon during the Decastar. This dimension is particularly linked to the variables of speed and long jump which constitute a homogeneous group.

The second axis opposes athletes who are strong (variables "Discus" and "Shot.put") between those who are not.
The variables "Discus", "Shot.put" and "High.jump" are not much correlated to the variables "X100m", "X400m", "X110m.hurdle" and "Long.jump". This means that strength is not much correlated to speed.

At the end of this first approach, we can divide the factorial plan into four parts: fast and strong athletes (like Sebrle), slow athletes (like Casarsa), fast but weak athletes (like Warners) and slow and weak (relatively speaking!) athletes (like Lorenzo).

supplementary variables

Supplementary variables have no influence on the principal components of the analysis. They are going to help to interpret the dimensions of variability.

We can add two different kinds of variables: continuous ones and categorical ones.

As supplementary continuous variables, we would like to add the variables "Rank" and "Points". We type the following line code: res.pca = PCA(decathlon[,1:12], scale.unit=TRUE, ncp=5, quanti.sup=c(11: 12), graph=T) #decathlon: the data set used
#scale.unit: to choose whether to scale the datas or not
#ncp: number of dimensions kept in the result
#quanti.sup: vector of the indexes of the quantitative supplementary variables
#graph: to choose whether to plot the graphs or not

Principale Component Analysis: variables scatterplot with supplementary continuous variables click to view

The winners of the decathlon are those who scored the most (or those whose rank is low).
The variables the most linked to the number of points are the variables which refer to the speed ("X100m", "X110m.hurdle", "X400m") and the long jump. On the contrary, "Pole-vault" and "X1500m" do not have a big influence on the number of points. Athletes who are strong for these two events are not favoured.

As a supplementary categorical variable, we add the variable "Competition": res.pca = PCA(decathlon, scale.unit=TRUE, ncp=5, quanti.sup=c(11: 12), quali.sup=13, graph=T) #decathlon: the data set used
#scale.unit: to choose whether to scale the datas or not
#ncp: number of dimensions kept in the result
#quanti.sup: vector of the indexes of the quantitative supplementary variables
#quali.sup: vector of the indexes of the qualitative supplementary variables
#graph: to choose whether to plot the graphs or not

The categories' centres of gravity of this new variable appear on the graph of the individuals. They are located at the barycentre of the individuals who took them and they represent an average individual.
We can also colour the individuals according to the categories' centres of gravity: plot.PCA(res.pca, axes=c(1, 2), choix="ind", habillage=13) #res.pca: the result of a PCA
#axes: the axes to plot
#choix: the graph to plot ("ind" for the individuals, "var" for the variables)
#habillage: to choose the colours of the individuals: no colour ("none"), a colour for each individual ("ind") or to colour the individuals according to a categorical variable (give the number of the qualitative variable)

Principal Component Analysis: individuals scatterplot with supplementary categoricall variable click to view

When looking at the points representing "Decastar" and "Olympic Games", we notice that this last one has higher coordinates on the first axis than the first one. This shows an evolution in the performances of the athletes. All the athletes who participated to the two competitions have then slightly better results for the Olympic Games.
However, there is no difference between the points "Decastar" and "Olympic Games" for the second axis. This means that the athletes have improved their performance but did not change profile (except for Zsivoczky who went from slow and strong during the Decastar to fast and weak during the Olympic Games).

We can see that the points which represent the same individual are in the same direction. For example, Sebrle got good results in both competition but the point which represents his performance during the O.G. is more extreme. Sebrle got more points during the O.G. than during the Decastar.

Two interpretations can be made:

  • Athletes who participate in the O.G. are better than those who participate in the Decastar
  • Athletes do their best for the O.G. (more motivated, more trained)

The dimdesc() function allows to describe the dimensions.
We can describe the dimensions given by the variables by writing: dimdesc(res.pca, axes=c(1,2)) #res.pca: the result of a PCA
#axes: the axes chosen

Principal Component Analysis: description of the dimensions; first axis click to view Principal Component Analysis: description of the dimensions; second axis

The dimdesc() function calculates the correlation coefficient between a variable an a dimension and performs a significance test.
These tables give the correlation coefficient and the p-value of the variables which are significantly correlated to the principal dimensions. Both active and supplementary variables whose p-value is smaller than 0.05 appear.

The tables of the description of the two principal axes show that the variables "Points" and "Long.jump" are the most correlated to the first dimension and "Discus" is the most correlated to the second one. This confirms our first interpretation.

This function is very useful in settings where there are many variables and allows to make the interpretation easier.

If we do not want an individual (or several ones) to participate in the analysis, it is possible to add it as a supplementary individual. It will not be active in the analysis but will bring supplementary information.
To add supplementary individuals, use the following argument of the PCA function: ind.sup

All the detailed results can be found in res.pca. We can get the eigenvalues, the results of the active and supplementary individuals, the results of the active variables and the results of the categorical and continuous supplementary variables by typing: res.pca names(res.pca) res.pca$eig, res.pca$ind, res.pca$ind.sup, res.pca$var, res.pca$quali.sup, res.pca$quanti.sup

To go further

In most PCAs, the weight of the individuals is equal to 1/(number of individuals). However, it is sometimes necessary to give a specific weight to some individuals.
To do so, the user will have to use the argument: row.w

It can also be interesting to give a specific weight to some variables.
Here, the user will have to use the argument: col.w

To see whether the categories of the supplementary variable are significantly different from each other, we can draw confidence ellipses around them.
To do so, type: concat = cbind.data.frame(decathlon[,13],res.pca$ind$coord)
ellipse.coord = coord.ellipse(concat,bary=T)
plot.PCA(res.pca,habillage=13,ellipse=ellipse.coord,cex=0.8)

Principal Component Analysis: confidence ellipsesClick to view