Continuous variables description
The condes() function allows to characterize a continuous variable by other continuous or categorical ones and categories.
We are going to use the data set "wine" and characterize the "Overall quality" variable.
Which continuous and categorical variables and which categories describe the best overall quality ?
First load the package and the data set by typing:
Then launch the condes() function:
res = condes(wine, num.var=30, proba=0.05)
#wine: the data set used
#num.var: the indice of the variable to characterize
#proba: the significance threshold considered to characterize the variable (by default 0.05)
Description by continuous variables
The correlation coefficient between each continuous variable and the Overall quality variable is calculated. Then, the correlation coefficients significantly different from zero are sorted and returned.
Overall quality is best described by Balance then Smooth then Harmony, etc... Wines with high scores for these variables will tend to have high scores for Overall quality too.
Plante is significant and negatively correlated to Overall quality. It means that the more a wine smells like plant after shaking, the less it pleases the assessors.
Description by categorical variables and categories
An anova model with one factor is done for each categorical variable; Overall quality is explained by the categorical variable.
A F-test is derived to see whether the variable has an influence on Overall quality and T-tests are done category by category (with the contrast sum alpha_i=0).
The variables and the categories are sorted by p-value and only the significant ones are kept.
Soil is the only significant categorical variable for Overall quality.
Reference has a positive coefficient whereas Env4 has a negative one. This means that wines grown on Reference are more appreciated (higher score for Overall quality) and wines grown on Env4 are less appreciated (lower scores) than average wines.