Getting further into multivariate analysis:
The GUSTA ME website contains descriptions of many multivariate analyses aimed at microbial ecologists using R.
# Load packages
library(ggplot2) # Plotting
library(phyloseq)
library(vegan) # PERMANOVA
iris: Data on Iris plant characteristics. A data.frame
data(iris)
head(iris)
phy: Our simulated gut microbiome dataset. A phyloseq object
load("../data/physeq.RData")
phy
How would you display 4 axes on a graph? What about 1000 axes? Microbiome datasets are high-dimensional, with maybe hundreds of samples and thousands of taxa. But we use 2-dimensional screens and paper, which limits our view of this data. Principle component analysis (PCA) is a way of projecting your high-dimensional data into lower dimensions. While this sounds obscure, it is extremely powerful. Imagine a dataset with 20 samples and 1000 variables (e.g. ASVs), how would you make a plot showing whether your samples are different? PCA transforms the data, and creates new axes (known as principle components (PCs)), these PCs are created such that PC1 spans as much variation as posssible in the dataset, PC2 spans as much variance as possible while being orthogonal (90 degree angle) to PC1, PC3 spans as much variance as possible while being orthogonal to PC1 and PC2, and so on.
The output from running a PCA is:
Is it still confusing?:
In R:
Let's run a PCA on the plant data, plot the first two axes, and color the points by the plant species
# We only use the first 4 columns
iris_sub <- iris[, 1:4]
# Always set center and scale. to TRUE
pca <- prcomp(iris_sub, center=TRUE, scale.=TRUE)
str(pca)
# pca$sdev contains the standard deviation of the PCs (used to calculate eigenvalues and thereby variance explained)
# pca$x contains the PCs
# pca$rotation contains loadings: How much the original variables contribute to each PC
# Turn new columns into a dataframe
pc_df <- data.frame(pca$x)
# Add the species column
pc_df$Species <- iris$Species
Now the dataframe looks like this:
head(pc_df)
We can only get as many PCs as we have samples or variables, whichever is lowest.
# Plot it
p <- ggplot(pc_df, aes(x = PC1, y = PC2, color = Species)) +
theme_bw() +
geom_point()
p
Let's look at which variables drive the PCs (loadings).
pca$rotation
PC1 is mostly a combination of Sepal.Length, Petal.Length, and Petal.Width. PC2 is mostly driven by Sepal.Width.
We can get the eigenvalues from the sdev:
eigen <- pca$sdev^2
eigen
From the eigenvalues we can get the variance explained:
ve <- eigen/sum(eigen)
ve
We can put this on the plot
p +
xlab(paste0("PC1 (", round(100*ve[1]), "%)")) +
ylab(paste0("PC2 (", round(100*ve[2]), "%)"))
PCA is based on linear combinations of the variables, and therefore assumes that data is not too skewed, and relatively normally distributed. Principle Coordinate Analysis (PCoA) is a generalized version of PCA. The input is a dissimilarity matrix. If this dissimilarity matrix is based on euclidean distances between samples, the PCs are the same as the PCA. The strength with PCoA is that we can use any dissimilarity metric as input. Various dissimilarity metrics appropriate for amplicon data are explained in the Beta diversity notebook.
Note on R code:
There are multiple different functions for calculating PCoA in R. The code presented below is different than what you would usually use when the input is an amplicon dataset (in a phyloseq object). Therefore, when you have a phyloseq object use the code presented in the beta diversity notebook. The code below is used if your input is a data.frame with observations as rows and variables as columns. If you want to use the code below on an amplicon dataset you would substitute iris_sub with for example t(otu_table(phy)) - The t() transposes the abundance table and is only necessary if taxa_are_rows(phy) returns TRUE.
In R:
Let's calculate the Bray-Curtis distance between samples, and then plot a PCoA
library(vegan)
dd <- vegdist(iris_sub, method = "bray")
Have a look at the dissimilarity matrix (only 5 first samples). Each value is the dissimilarity between a pair of samples. The diagonal is the dissimilarity between the same samples, and is, as expected, equal to zero.
as.matrix(dd)[1:5,1:5]
# PCoA
pcoa_data <- cmdscale(dd, eig = TRUE)
# Collect in data.frame
pcoa_df <- data.frame(PC1 = c(pcoa_data$points[,1]),
PC2 = c(pcoa_data$points[,2]),
Species = iris$Species)
# Variance explained
ve <- pcoa_data$eig/sum(pcoa_data$eig)
# Plot it
p <- ggplot(pcoa_df, aes(x = PC1, y = PC2, color = Species)) +
theme_bw() +
geom_point() +
xlab(paste0("PCoA 1 (",round(ve[1]*100,1),"%)")) +
ylab(paste0("PCoA 2 (",round(ve[2]*100,1),"%)"))
p
PERmutational Multivariate ANalysis Of VAriance (PERMANOVA) is a method to estimate the effect a variable has on the entire microbial community. For example, for our gut microbiome dataset, we could ask whether the microbial community is different at different time points. PERMANOVA does not tell us HOW the microbial communities differ; if we want to know specifically which taxa are different between different time points we would run a Differential Abundance analysis (see notebook on this topic).
The input to PERMANOVA is a dissimilarity matrix. The results of the PERMANOVA is therefore very contingent on the dissimilarity metric that is used.
Note: The p-values are calculated by permutation, which is a random process, and will therefore vary a little each time you run it.
In R:
# Calculate unweighted UniFrac (our dissimilarity matrix)
UF <- UniFrac(phy, weighted = FALSE)
# Extract sample data
Samp <- data.frame(sample_data(phy))
# We use the adonis function from the vegan package to run PERMANOVA
adonis(UF ~ Time, data = Samp)
We can see that Time has a significant effect (P-value = 0.001) on the microbial community (as summarized by unweighted UniFrac). The variance explained (R2) by the Time variable is 22.4%.
Here we add Delivery
adonis(UF ~ Time + Delivery, data = Samp)
For example, here we test whether the effect of Delivery changes depending on the time point, or vice versa (Time:Delivery).
adonis(UF ~ Time * Delivery, data = Samp)
In this last model we see that Time has a significant effect explaining 22.4% of the variance. Delivery mode also has a significant effect explaining 1.8% of the variance. Furthermore, there is a significant interaction between Time and Delivery explaining 2.9% of the variance. That there is an interaction means that the two variables are not having indepedent effects on the UniFrac dissimilariy. In other words, the effect of Delivery is dependent on the Time variable, and vice versa. This is probably because there is a large effect of delivery in the early time points, but small in the later time point. One could test this by running a PERMANOVA testing the effect of delivery separately for each time point.