Correspondence Analysis
As an example, we use here a data set issued from a questionnaire about french women's work in 1974.
You can load the data set here
Presentation of the data
1724 women have answered several questions about women'work among which:
- What do you think the perfect family is ?
- Both husband and wife work
- Husband works more than wife
- Only husband works
- Which activity is the best for a mother when children go to school?
- Stay at home
- Part-time work
- Full-time work
- What do you think of the following sentence: women who do not work feel cut off from the world?
- Totally agree
- Quite agree
- Quite disagree
- Totally disagree
The data set is two contingency tables which cross the answers of the first question with the two others.
To each crossing, the value given is the number of women who gave both answers.
To load the package and the data set, write the following line code:
library(FactoMineR)
women_work=read.table("http://factominer.free.fr/classical-methods/datasets/women_work.txt", header=TRUE, row.names=1, sep="\t")
Objectives
The objectives of CA are quite the same as PCA's: to get a typology of rows and columns and to study the link between these two typologies.
However, the concept of similarity between rows or columns is different. Here, similarity between two rows or two columns is completely symmetric. Two rows (resp. columns) will be close to each other if they associate with the columns (resp. rows) in the same way.
We are looking for the rows (resp. columns) whose distribution is the most different from the population's. The ones which look the most or the less alike.
Each group of rows (resp. columns) is characterized by the columns (resp. rows) to which it is too much or to little associated.
CA
We are going to use the first three columns (corresponding to the answers to the second question) as active variables and the four last ones (corresponding to the third question) as supplementary variables.
Active rows and columns only
To see the scatterplots of rows and columns separately, type:
res.ca.rows = CA(women_work[,1:3], invisible="col")
res.ca.col = CA(women_work[,1:3], invisible="row") #women_work: the data set used
#invisible: elements we do not want to be plotted
On the scatterplot of the columns, we can see that the first axis opposes "Stay at home" and "Full-time work", which means it opposes two women's profiles.
Women who answered "Stay at home" answered "Only husband works" more often than the population and "Both husband and wife work" less often than the population.
In the same way, women who answered "Full-time work" answered "Only husband works" less often than the population and "Both husband and wife work" more often than the population. The first axis orders the categories of the second question from the less to the most in favour of women's work.
We can make the same interpretation for the first axis of the row's scatterplot. The categories are sorted from the less ("Only husband works") to the most ("Both husband and wife work") in favour of women's work.
To have the representation of both rows and columns, type:
res.ca = CA(women_work[,1:3])
#women_work: the data set used
"Stay at home" is much associated with "Only husband works" and little associated to the two other categories.
"Both husband and wife work" is associated with "Full-time work" and opposed to "Stay at home".
Addition of supplementary columns
We now add the columns corresponding to the third question as supplementary variables. Type:
res.ca = CA(women_work, col.sup=4:ncol(women_work))
#women_work: the data set used
#col.sup: vector of the indexes of the supplementary columns
"Totally agree" and "Quite agree" for "Women who do not work feel cut off from the world" are close to categories in favour of women's work.
"Quite disagree" and "Totally "disagree" are close to categories opposed to women's work.