Alpha diversity is the diversity within a sample/community. There are tons of different metrics for estimating alpha diversity, and only a small subset of the metrics will be presented here. It is crucial that the dataset is not trimmed/pruned prior to calculating the alpha diversity.
# Load phyloseq object and phyloseq package
library(phyloseq)
load("../data/physeq.RData")
Richness is the number of different taxa in a community. Observed richness is therefore simply the observed number of different taxa in a sample, the simplest alpha diversity metric.
In R:
obs_rich <- estimate_richness(phy, measures = "Observed")
head(obs_rich)
or with apply:
obs_rich <- apply(otu_table(phy), MARGIN = 2, function(x) sum(x > 0))
head(obs_rich)
Note on R code: apply in R does something on each row (MARGIN = 1) or column (MARGIN = 2) of a matrix. The otu_table(phy) we give as input contains counts of taxa with one column per sample. We therefore run the function on each sample. sum(x > 0) means that we find all counts that are larger than zero and we sum (count) those. This is the observed richness; the number of taxa that are present (abundance > 0) in a sample.
Shannon diversity (AKA Shannon entropy, AKA Shannon index), is a widely used alpha diversity metric. This metric is a combination of the richness (see above) and the evenness in sample, where the evenness is a measure of how equally abundant different taxa are in a sample. If all taxa have the same abundance then evenness is high, if one or few taxa dominates evenness is low.
Shannon diversity is calcualted as: \begin{equation*} H' = - \sum_{i=1}^R p_i ln \left( p_i \right) \end{equation*}
where pi is the relative abundance of taxa i. ln is the natural log, and R is the richness.
It was developed as a measure of information, and can be thought of as the uncertainty in whether two random individuals in a sample/community are similar. Therefore, the more different taxa are observed in a sample, and the more even their abundance is, the higher is the Shannon diversity.
From the Shannon diversity we can calculate the effective number of taxa, also called true diversity of order 1 (1D), as follows: 1D=exp(H'). The 1D can be seen as an abundance-corrected observed richness, based on the assumption that microbial communities with high evenness are effectively more diverse than communities with low evenness. The observed richness is true diversity of order 0 (0D), since it is ignoring relative abundance altogether. There are diveristy indices of higher orders than 1, which weigh the relative abundance more than Shannon diversity, effectively weighing the common (as opposed to the rare) taxa even more.
In R:
shannon <- estimate_richness(phy, measures = "Shannon")
head(shannon)
or directly with the formula above:
shannon <- apply(otu_table(phy), MARGIN = 2, function(x) -sum(x/sum(x)*log(x/sum(x)), na.rm = TRUE))
head(shannon)
Calculate effective number of taxa:
D1 = exp(shannon)
head(D1)
Note on R code: As with the observed richness, we apply a function on each column (sample) of the otu_table. The function is the formula above, where x/sum(x) is a way to calculate the relative abundance. The log function will return NA for many taxa if their abundance is 0 (you cannot take the log of zero), and the na.rm = TRUE tells the function to ignore NA's in the final summation.
Phylogenetic diversity (PD) is an alpha diversity metric which incorporates the phylogenetic tree of the taxa. Faith's PD is the sum of all the branch lengths that are connecting all taxa observed in the sample.
In the tree below, the cyan circles marks the taxa present in a given sample. The red line is the path connecting these taxa (usually the root is also included if present). Faith's PD of this sample is then the branch length spanned by the red path; that is, all the red horizontal lines.
In R:
# We need the picante package
library(picante)
# PD: Phylogenetic Diversity
# SR: Species Richness (aka Observed richness)
fpd <- pd(t(otu_table(phy)), phy_tree(phy))
head(fpd)
Note on R code: The pd function of course needs our phylogenetic tree as input, but it also needs the otu_table to know which taxa are present in which samples. The picante package assumes that input tables have samples as rows, we therefore have to transpose our otu_table with the t() function.
We will use the ggplot2 package for plotting.
library(ggplot2)
First, we will calculate observed richness and shannon diversity in one line, then we will add our sample metadata:
alpha <- estimate_richness(phy, measures = c("Observed", "Shannon"))
alpha <- merge(alpha, sample_data(phy), by = "row.names")
The alpha data.frame now looks like this:
head(alpha)
# Define plot
plot_alpha <- ggplot(alpha, aes(x = Time, y = Observed)) +
geom_point()
# Show plot
plot_alpha
We can plot as boxplots with a black and white theme and with custom labels:
# Define plot
plot_alpha <- ggplot(alpha, aes(x = Time, y = Observed)) +
theme_bw() +
geom_boxplot() +
xlab("Time point") +
ylab("Observed richness")
# Show plot
plot_alpha
From the plots above it appears as the richness is higher at 1 year (1y) than the other time points. Can this be supported statistically?
To test if the richness differs between the time points we would run an ANOVA.
Let's first make Time into a factor, and set the levels chronologically:
alpha$Time <- factor(alpha$Time, levels = c("1w", "1m", "1y"))
Then we need to define the anova model:
A note on notation: The left-hand side of the formula (y ~ x) is the response (AKA dependent variable) and the right-hand side are the features (AKA independent variables, AKA covariates)
model <- aov(Observed ~ Time, data = alpha)
We can then get a summary of our model, and see that the time variable is highly significant.
summary(model)
We can also run the Tukey's Honest Significant Different test as a post-hoc test to compare all time points pairwise:
TukeyHSD(model)
This result fits with what we see in plot. The richness at 1 year is higher than at 1 month and 1 week.
The left-hand side of the output are the pairwise comparisons. Diff is the difference in means. Lwr and upr are lower and upper 95% confidence intervals (of the difference). p adj is the adjusted p-value.
If the dataset is not rarefied (see notebook on rarefaction for details), we should include the library size (the total number of reads in a given sample) in the model, as this could have substantial effect on the alpha diversity estimation. As the library size is a continuous variable we should run a linear model instead of an anova.
# First we include the library size in our 'alpha' data.frame
libSize <- data.frame(LibSize = sample_sums(phy))
alpha <- merge(alpha, libSize, by.x = "Row.names", by.y = "row.names")
# Then define the new model
model2 <- lm(Observed ~ Time + log10(LibSize), data = alpha)
In this linear model the time variable is split in three so-called dummy variables, one for each level, the first being the intercept. The interpretation here is that: the intercept represents the 1 week time point. The estimate next to Time1m shows that the richness at 1 month is 1.4 lower than at 1 week. The estimate next to Time1y shows that the richness at 1 year is 39.7 higher than at 1 week. The difference between 1 year and 1 week is significant (p < 2e-16), but not the difference between 1 week and 1 month (p = 0.3195). Furthermore, the log10(LibSize) variable shows that every time the library size gets 10 times larger the richness increases by 9.6, and this is significant (p = 2.14e-12).
summary(model2)
If we want to compare 1 month with 1 year, we can test all pairwise time points, as with the TukeyHSD above. We just need to use another function from the multcomp pacage.
# Load package
library(multcomp)
# Run Tukey contrasts on the time variable on our model2
summary(glht(model2, linfct=mcp(Time="Tukey")))
This result agrees with our first model; 1 month and 1 week do not have different richness, and at 1 year the richness is higher.
In this dataset we have repeated measurements from the same patients; we have samples from the same children at 1 week, 1 month, and 1 year. It is therefore best practice to take this into account in our statistical model. This analysis is outside the scope of this notebook, but more information can be found in the Coding Club.