Hierarchical Clustering on Principal Components

The following article describe in details why it is interesting to perform a hierachical clustering with principal component methods. It also gives some examples.
Husson, F., Josse, J. & Pagès J. (2010). Principal component methods - hierarchical clustering - partitional clustering: why would we need to choose for visualizing data?. Technical report.

We are going to perform a hierarchical classification on the principal components of a factorial analysis. The dataset used is the dataset "tea" already taken to illustrate the Multiple Correspondence Analysis.

Objectives

We want to gather the 300 individuals of the dataset into a couple of clusters which would correspond to different consumption profiles.

As the variables are categorical, we will first perform an MCA then use the coordinates of the individuals on the principal components for the hierarchical classification. MCA is used as a preprocessing to transform categorical variables into continuous ones.

HCPC

The first step is to perform an MCA on the individuals.

As well as previously (see MCA page), we perform the MCA using the variables about consumption behavior as active ones.
We do not use the last axes of the MCA because they are considered as noise and would make the clustering less stable. We thus keep only the 20 first axis of the MCA which resume 87% of the information.

Type: library(FactoMineR)
data(tea)
res.mca = MCA(tea, ncp=20, quanti.sup=19, quali.sup=c(20:36), graph=FALSE) #tea: the data set used
#ncp: number of dimensions which are kept for the analysis
#quanti.sup: vector of indexes of continuous supplementary variables
#quali.sup: vector of indexes of categorical supplementary variables
#graph: logical. If FALSE, no graph is plotted

We then perform the hierarchical classification: res.hcpc = HCPC(res.mca) #res.mca: the result of an MCA

The hierarchical tree suggests a clustering into three clusters:

Hierarchical Clustering on Principal Components: hierarchical tree click to view Hierarchical Clustering on Principal Components: hierarchical tree cut into three clusters

We get a three dimensional tree and a factorial map where individuals are coloured depending on the cluster they belong to.

Hierarchical Clustering on Principal Components: 3D hierarchical tree click to view Hierarchical Clustering on Principal Components: factorial map of the individuals coloured by cluster

Description of the clusters

Clusters can be described by:

  • Variables and/or categories
  • Factorial axes
  • Individuals

Description by Variables and/or categories

res.hcpc$desc.var$test.chi2
res.hcpc$desc.var$category

Hierarchical Clustering on Principal Components: chi² test click to view Hierarchical Clustering on Principal Components: description of the third cluster by the categories

Variables "where" and "how" are those which characterize the most the partition in three clusters.

Each cluster is characterized by a category of the variables "where" and "how". Only the categories whose p-value is less than 0.02 are used. For example, individuals who belong to the third cluster buy tea as tea bag and unpackaged tea both in chain stores and tea shops.

description by principal components

res.hcpc$desc.axes

Hierarchical Clustering on Principal Components: description by factorial axes click to view

Individuals in cluster 1 have low coordinates on axes 1 and 2. Individuals in cluster 2 have high coordinates on the second axis and individuals who belong to the third cluster have high coordinates on the first axis. Here, a dimension is kept only when the v-test is higher than 3.

description by Individuals

Two kinds of specific individuals exist to describe the clusters:

  • Individuals closest to their cluster's center
  • Individuals the farest from other clusters' center

res.hcpc$desc.ind

Hierarchical Clustering on Principal Components: individuals closest to their cluster's centerclick to view Hierarchical Clustering on Principal Components: individuals farest from other clusters' center

Individual 285 belongs to cluster 1 and is the closest to cluster 1's center.
Individual 82 belongs to cluster 1 and is the farest from clusters 2 and 3's centers.

To go further

Transformation of continuous variables into categorical ones

To cut a single continuous variable into clusters: vari = tea[,19]
res.hcpc = HCPC(vari, iter.max=10) max.cla=unlist(by(res.hcpc$data.clust[,1], res.hcpc$data.clust[,2], max))
breaks = c(min(vari), max.cla)
aaQuali = cut(vari, breaks, include.lowest=TRUE)
summary(aaQuali)
#iter.max: The maximum number of iterations for the consolidation

To cut several continuous variables into clusters: data.cat = data
for (i in 1:ncol(data.cat)){
vari = data.cat[,i]
res.hcpc = HCPC(vari, nb.clust=-1, graph=FALSE)
maxi = unlist(by(res.hcpc$data.clust[,1], res.hcpc$data.clust[,2], max))
breaks = c(min(vari), maxi)
aaQuali = cut(vari, breaks, include.lowest=TRUE)
data.cat[,i] = aaQuali
}
#data: dataset with the continuous variables to be cut into clusters