Hands-on Exercise 5.2: Visual Correlation Analysis

Published

February 2, 2024

Modified

February 2, 2024

1 Getting Started

1.1 Install and loading R packages.

pacman::p_load(corrplot, ggstatsplot, tidyverse)

1.2 Importing the data

wine <- read_csv("../../data/wine_quality.csv")

2 Building Correlation Matrix: pairs() method

Reading the syntax description of pairs function.

2.1 Building a basic correlation matrix

Click to view the code.

pairs(wine[,1:11])

The required input of pairs() can be a matrix or data frame. The code chunk used to create the scatterplot matrix is relatively simple. It uses the default pairs function.

Columns 2 to 12 of wine dataframe is used to build the scatterplot matrix. The variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol.

Click to view the code.

pairs(wine[,2:12])

2.2 Drawing the lower corner

pairs function of R Graphics provided many customisation arguments. For example, it is a common practice to show either the upper half or lower half of the correlation matrix instead of both. This is because a correlation matrix is symmetric.

To show the lower half of the correlation matrix, the upper.panel argument will be used as shown in the code chunk below.

Click to view the code.

pairs(wine[,2:12], upper.panel = NULL)

Similarly, the upper half of the correlation matrix can be displayed by the code chunk below.

Click to view the code.

pairs(wine[,2:12], lower.panel = NULL)

2.3 Including with correlation coefficients

To show the correlation coefficient of each pair of variables instead of a scatter plot, panel.cor function will be used. This will also show higher correlations in a larger font.

Click to view the code.

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use="complete.obs"))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}

pairs(wine[,2:12], 
      upper.panel = panel.cor)

3 Visualising Correlation Matrix: ggcormat()

One of the major limitation of the correlation matrix is that the scatter plots appear very cluttered when the number of observations is relatively large (i.e. more than 500 observations). To over come this problem, Corrgram data visualisation technique suggested by D. J. Murdoch and E. D. Chow (1996) and Friendly, M (2002) and will be used.

The are at least three R packages provide function to plot corrgram, they are:

On top that, some R package like ggstatsplot package also provides functions for building corrgram.

3.1 The basic plot

On of the advantage of using ggcorrmat() over many other methods to visualise a correlation matrix is it’s ability to provide a comprehensive and yet professional statistical report as shown in the figure below.

Click to view the code.

ggstatsplot::ggcorrmat(
  data = wine, 
  cor.vars = 1:11)

Click to view the code.

ggstatsplot::ggcorrmat(
  data = wine, 
  cor.vars = 1:11,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  title    = "Correlogram for wine dataset",
  subtitle = "Four pairs are no significant at p < 0.05"
)

Things to learn from the code chunk above:

cor.vars argument is used to compute the correlation matrix needed to build the corrgram.
ggcorrplot.args argument provide additional (mostly aesthetic) arguments that will be passed to ggcorrplot::ggcorrplot function. The list should avoid any of the following arguments since they are already internally being used: corr, method, p.mat, sig.level, ggtheme, colors, lab, pch, legend.title, digits.

4 Building multiple plots

Since ggstasplot is an extension of ggplot2, it also supports faceting. However the feature is not available in ggcorrmat() but in the grouped_ggcorrmat() of ggstatsplot.

Click to view the code.

grouped_ggcorrmat(
  data = wine,
  cor.vars = 1:11,
  grouping.var = type,
  type = "robust",
  p.adjust.method = "holm",
  plotgrid.args = list(ncol = 2),
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  annotation.args = list(
    tag_levels = "a",
    title = "Correlogram for wine dataset",
    subtitle = "The measures are: alcohol, sulphates, fixed acidity, citric acid, chlorides, residual sugar, density, free sulfur dioxide and volatile acidity",
    caption = "Dataset: UCI Machine Learning Repository"
  )
)

Things to learn from the code chunk above:

to build a facet plot, the only argument needed is grouping.var.
Behind group_ggcorrmat(), patchwork package is used to create the multiplot. plotgrid.args argument provides a list of additional arguments passed to patchwork::wrap_plots, except for guides argument which is already separately specified earlier.
Likewise, annotation.args argument is calling plot annotation arguments of patchwork package.

5 Visualising Correlation Matrix using corrplot Package

Reading An Introduction to corrplot Package in order to gain basic understanding of corrplot package.

5.1 Getting started with corrplot

Before we can plot a corrgram using corrplot(), we need to compute the correlation matrix of wine data frame.

In the code chunk below, cor() of R Stats is used to compute the correlation matrix of wine data frame.

Click to view the code.

wine.cor <- cor(wine[, 1:11])

Next, corrplot() is used to plot the corrgram by using all the default setting as shown in the code chunk below.

Click to view the code.

corrplot(wine.cor)

circle: default visual object used to plot the corrgram
default layout of the corrgram: symmetric matrix
default colour scheme - diverging blue-red

blue - represent pair variables with positive correlation coefficients

red - represent pair variables with negative correlation coefficients
the intensity of the colour (saturation) - represent the strength of the correlation coefficient

darker - relatively stronger linear relationship between the paired variables

lighter - relatively weaker linear relationship

5.2 Working with visual geometrics

In corrplot package, there are seven visual geometrics (parameter method) can be used to encode the attribute values. They are: circle, square, ellipse, number, shade, color and pie. The default is circle.

As shown in the previous section, the default visual geometric of corrplot matrix is circle. However, this default setting can be changed by using the method argument as shown in the code chunk below.

square
ellipse

Click to view the code.

corrplot(wine.cor, method = "square")

Click to view the code.

corrplot(wine.cor, method = "ellipse")

5.3 Working with layout

corrplor() supports three layout types, namely: “full”, “upper” or “lower”. The default is “full” which display full matrix. The default setting can be changed by using the type argument of corrplot().

Click to view the code.

corrplot(wine.cor, 
         method = "ellipse", 
         type="lower")

The default layout of the corrgram can be further customised. For example, arguments diag and tl.col are used to turn off the diagonal cells and to change the axis text label colour to black colour respectively as shown in the code chunk and figure below.

Click to view the code.

corrplot(wine.cor, 
         method = "ellipse", 
         type="lower",
         diag = FALSE,
         tl.col = "black")

5.4 Working with mixed layout

With corrplot package, it is possible to design corrgram with mixed visual matrix of one half and numerical matrix on the other half. In order to create a coorgram with mixed layout, the corrplot.mixed(), a wrapped function for mixed visualisation style will be used.

Figure below shows a mixed layout corrgram plotted using wine quality data.

Click to view the code.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Notice that argument lower and upper are used to define the visualisation method used. In this case ellipse is used to map the lower half of the corrgram and numerical matrix (i.e. number) is used to map the upper half of the corrgram. The argument tl.pos, on the other, is used to specify the placement of the axis label. Lastly, the diag argument is used to specify the glyph on the principal diagonal of the corrgram.

5.5 Combining corrgram with the significant test

In statistical analysis, we are also interested to know which pair of variables their correlation coefficients are statistically significant.

Figure below shows a corrgram combined with the significant test. The corrgram reveals that not all correlation pairs are statistically significant. For example the correlation between total sulfur dioxide and free surfur dioxide is statistically significant at significant level of 0.1 but not the pair between total sulfur dioxide and citric acid.’

With corrplot package, we can use the cor.mtest() to compute the p-values and confidence interval for each pair of variables.

Click to view the code.

wine.sig = cor.mtest(wine.cor, conf.level= .95)

We can then use the p.mat argument of corrplot function as shown in the code chunk below.

Click to view the code.

corrplot(wine.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = wine.sig$p,
         sig.level = .05)

5.6 Reorder a corrgram

Matrix reorder is very important for mining the hiden structure and pattern in a corrgram. By default, the order of attributes of a corrgram is sorted according to the correlation matrix (i.e. “original”). The default setting can be over-write by using the order argument of corrplot(). Currently, corrplot package support four sorting methods, they are:

“AOE” is for the angular order of the eigenvectors. See Michael Friendly (2002) for details.
“FPC” for the first principal component order.
“hclust” for hierarchical clustering order, and “hclust.method” for the agglomeration method to be used.

“hclust.method” should be one of “ward”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
“alphabet” for alphabetical order.

“AOE”, “FPC”, “hclust”, “alphabet”. More algorithms can be found in seriation package.

Click to view the code.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               order="AOE",
               tl.col = "black")

5.7 Reordering a correlation matrix using hclust

If using hclust, corrplot() can draw rectangles around the corrgram based on the results of hierarchical clustering.

Click to view the code.

corrplot(wine.cor, 
         method = "ellipse", 
         tl.pos = "lt",
         tl.col = "black",
         order="hclust",
         hclust.method = "ward.D",
         addrect = 3)