# Hierarchical Clustering on Principal Components

The following article describe in details why it is interesting to perform a hierachical clustering with principal component methods. It also gives some examples.

Husson, F., Josse, J. & Pagès J. (2010). Principal component methods - hierarchical
clustering - partitional clustering: why would we need to choose for visualizing data?. *Technical report*.

We are going to perform a hierarchical classification on the principal components of a factorial analysis. The dataset used is the dataset "tea" already taken to illustrate the Multiple Correspondence Analysis.

## Objectives

We want to gather the 300 individuals of the dataset into a couple of clusters which would correspond to different consumption profiles.

As the variables are categorical, we will first perform an MCA then use the coordinates of the individuals on the principal components for the hierarchical classification. MCA is used as a preprocessing to transform categorical variables into continuous ones.

## HCPC

The first step is to perform an MCA on the individuals.

As well as previously (see MCA page), we perform the MCA using the variables about consumption behavior as active ones.

We do not use the last axes of the MCA because they are considered as noise and would make the clustering less stable. We thus keep only the 20 first axis of the MCA which resume 87% of the information.

Type:
`library(FactoMineR)`

data(tea)`res.mca = MCA(tea, ncp=20, quanti.sup=19, quali.sup=c(20:36), graph=FALSE)`

`#tea: the data set used`

#ncp: number of dimensions which are kept for the analysis

#quanti.sup: vector of indexes of continuous supplementary variables

#quali.sup: vector of indexes of categorical supplementary variables

#graph: logical. If FALSE, no graph is plotted

We then perform the hierarchical classification:
`res.hcpc = HCPC(res.mca)`

`#res.mca: the result of an MCA`

The hierarchical tree suggests a clustering into three clusters:

We get a three dimensional tree and a factorial map where individuals are coloured depending on the cluster they belong to.

## Description of the clusters

Clusters can be described by:

- Variables and/or categories
- Factorial axes
- Individuals

### Description by Variables and/or categories

`res.hcpc$desc.var$test.chi2`

res.hcpc$desc.var$category

Variables *"where"* and *"how"* are those which characterize the most the partition in three clusters.

Each cluster is characterized by a category of the variables *"where"* and *"how"*. Only the categories whose p-value is less than 0.02 are used. For example, individuals who belong to the third cluster buy tea as tea bag and unpackaged tea both in chain stores and tea shops.

### description by principal components

`res.hcpc$desc.axes`

Individuals in cluster 1 have low coordinates on axes 1 and 2. Individuals in cluster 2 have high coordinates on the second axis and individuals who belong to the third cluster have high coordinates on the first axis. Here, a dimension is kept only when the v-test is higher than 3.

### description by Individuals

Two kinds of specific individuals exist to describe the clusters:

- Individuals closest to their cluster's center
- Individuals the farest from other clusters' center

`res.hcpc$desc.ind`

Individual *285* belongs to cluster 1 and is the closest to cluster 1's center.

Individual *82* belongs to cluster 1 and is the farest from clusters 2 and 3's centers.

## To go further

### Transformation of continuous variables into categorical ones

To cut a single continuous variable into clusters:
`vari = tea[,19]`

res.hcpc = HCPC(vari, iter.max=10)
max.cla=unlist(by(res.hcpc$data.clust[,1], res.hcpc$data.clust[,2], max))

breaks = c(min(vari), max.cla)

aaQuali = cut(vari, breaks, include.lowest=TRUE)

summary(aaQuali)`#iter.max: The maximum number of iterations for the consolidation`

To cut several continuous variables into clusters:
`data.cat = data`

for (i in 1:ncol(data.cat)){

vari = data.cat[,i]

res.hcpc = HCPC(vari, nb.clust=-1, graph=FALSE)

maxi = unlist(by(res.hcpc$data.clust[,1], res.hcpc$data.clust[,2], max))

breaks = c(min(vari), maxi)

aaQuali = cut(vari, breaks, include.lowest=TRUE)

data.cat[,i] = aaQuali

}`#data: dataset with the continuous variables to be cut into clusters`