Neither alpha or beta diversity analysis tells us which taxa are different in our treatment groups. For this, we use differential abundance (DA) analysis. DA methods compare the abundance of each taxa between our treatment groups to tell which are differentially abundant.
There are many different tools for conducting DA analysis, and no gold standard. The problem is that there are many statistical challenges with analysing amplicon data, which the different tools try to solve in different ways:
# Load phyloseq object and phyloseq package
library(phyloseq)
load("../data/physeq.RData")
Here we will use the DAtest package for DA analysis. The advantage with DAtest is that all the different methods require the same standardized input, also making it simple to conduct many analysis and compare them.
library(DAtest)
# Load MicEco for helper functions
library(MicEco)
First step before doing differential abundance analysis is to decide at which taxonomic/phylogenetic level you want to test your hypothesis. Do you want to be specific and do it at the ASV level, or is it more fitting with your hypothesis to test at a higher taxonomic level.
Things to consider when choosing whether to agglomorate:
The choice depends entirely on the hypotheses of the study.
Removing low abundant taxa (ASVs or another level, depending on agglomoration), is a useful and widely used preparation step before performing DA analysis. The reason is that many taxa in microbiome studies are rare, and often many are so rare that there is no way they would be statistically significant. An obvious example is an ASV only present in one sample. Depending on your sample size and experimental setup we would often want ASVs to be present in at least 5 samples (unless your treatment groups are less than this size). Another threshold to filter by is the abundance/reads; if an ASV only have few reads the estimation of its abundance will be poor.
There are, unfortunately, no gold standard on how to set the thresholds when pruning low abundant ASVs. It is a trade-off between keeping rare ASVs which could be interesting to test, and removing ASVs to increase statistical power when testing the remaining ASVs.
In R:
We use the ps_prune function to filter, which is also presented in the phyloseq operations notebook.
# Agglomorate to Genus level
phy_genus <- tax_glom(phy, "Genus")
# Only keep genera present in at least 10 samples
phy_genus_pre <- ps_prune(phy_genus, min.samples = 10)
Wilcoxon Rank Sum test is used to test if the median of two statistical samples are different (See more in the Statistics 101 notebook). If we have a simple experimental design with for example a control and treatment group, we can conduct a Wilcoxon Rank Sum test for each taxa, comparing whether they are different in control versus treatment group.
In R:
res_wil <- DA.wil(phy_genus_pre, predictor = "Delivery")
head(res_wil)
Output explanation:
Usually we are only interested in those that are significantly abundant. So we can subset the data.frame with results:
res_wil[res_wil$pval.adj <= 0.05, ]
11 significantly abundant genera!
We can also order it, and see which have the highest fold change:
res_wil_ordered <- res_wil[rev(order(abs(res_wil$log2FC))), ]
head(res_wil_ordered)
MetagenomeSeq Featuretest was developed for microbiome data, and is usually the most powerful for finding differentially abundant taxa.
Let's run MetagenomeSeq to test the effect of Delivery:
res_msf <- DA.msf(phy_genus_pre, predictor = "Delivery")
MetagenomeSeq output:
Let's only look at significant ASVs:
res_msf[res_msf$pval.adj <= 0.05, ]
16 significantly abundant genera with MetagenomeSeq!
DEseq2 was originally developed for RNA-seq (transcriptomic) data, but this is very similar to 16S rRNA gene amplicon data, and DESeq2 has been shown to perform well on this kind of data. Furthermore, with DESeq2 you can create more complicated models, with for example multiple variables both categorical and quantitative.
Let's run DESeq2 to test the effect of Time:
res_ds2 <- DA.ds2(phy_genus_pre, predictor = "Time")
head(res_ds2)
It is important to remember that our time variable has three levels: 1w, 1m, and 1y. By default we only get the log fold change of the second level (1w here) against the first level (1m here). The p-value is an overall p-value telling us whether any of the levels (timepoints) are different.
Output explanation:
Let's see if there any significant:
res_ds2[res_ds2$pval.adj <= 0.05, ]
Many genera are significantly differentially abundant between the different time points
We can see that Clostridium, the first one, is more abundant in 1 week than in 1 month, but what about 1 year?
We can change the coeff argument. If we set coeff=3 then we will compare the third level (1 year) with the first level (1 month). The p-value is the same, but the fold change and ordering will change.
res2 <- DA.ds2(phy_genus_pre, predictor = "Time", coeff = 3)
See results for the Clostridium:
res2[res2$Feature == "03a722bba911de891254934069dc8959", ]
We can in the same way change the reference level for calculating the fold change. Let's change it to 2 (1 week):
res3 <- DA.ds2(phy_genus_pre, predictor = "Time", coeff = 3, coeff.ref = 2)
res3[res3$Feature == "03a722bba911de891254934069dc8959", ]
Based on this we can conclude that Clostridium is most abundant at 1 week, hereafter at 1 month, and least abundant at 1 year.
The log2 fold change (log2FC) is the log2 ratio between the abundance in one group versus another group. For example, in the above analysis it is the ratio in abundance of Clostridium at 1 year versus at 1 week. We can see that it is -2.8. A negative log2FC means that Clostridium is in higher abundance in the reference group (1 week). Since it is on a log2 scale, a log2FC of 1 means the abundance is double compared to the reference, 2 means it is quadruple that of the reference, -1 means half, and -2 means a quarter, and so on. It is quite common to ignore log2FC between -2 and 2, and only focus on large differences. This choice, however, depends on the hypothesis in question.
With DESeq2 we can include covariates in the analysis. If we want to test the effect of mode of Delivery, we can take into account the differences across time points. Here we would say that we "control" for the effect of time, when testing the effect of Delivery.
res_covar <- DA.ds2(phy_genus_pre, predictor = "Delivery", covars = "Time")
res_covar[res_covar$pval.adj <= 0.05, ]
This does not tells at which time points the different genera are significantly associated with mode of delivery. For this, we would subset the data to the different time points, and conduct the analysis separately for each time point.
A common way to plot results from a differential abundance analysis is the volcano plot. In a volcano plot we plot the p-value against the log2 fold change. One can then color according to for example taxonomy.
To make the plot easier to read, we take the log10 of the p-value and reverse it, such that significant taxa have a high value.
Let's plot the Delivery results from above:
# Plotting package
library(ggplot2)
# Define plot
p <- ggplot(res_covar, aes(x = log2FoldChange, y = -log10(pval.adj))) +
theme_bw() +
geom_point()
# View plot
p
We can add a line to discrimate between significant and non-significant. And we can color by Phylum
# Define plot
p <- ggplot(res_covar, aes(x = log2FoldChange, y = -log10(pval.adj), color = Phylum)) +
theme_bw() +
geom_point() +
geom_hline(yintercept = -log10(0.05))
# View plot
p
Annotate the significant ones:
# Annotation package (install first time)
# install.packages("ggrepel")
library(ggrepel)
# Subset only highly significant
res_covar_sig <- res_covar[res_covar$pval.adj <= 0.01, ]
# Define plot
p <- ggplot(res_covar, aes(x = log2FoldChange, y = -log10(pval.adj), color = Phylum)) +
theme_bw() +
geom_point() +
geom_hline(yintercept = -log10(0.05)) +
geom_text_repel(data = res_covar_sig,
aes(x = log2FoldChange, y = -log10(pval.adj), label = Genus),
size = 3)
# View plot
p