Beta diversity

Beta diversity is the diversity between communities. Whereas alpha diversity is the local diversity, that is the diversity within a community, beta diversity is the diversity which differentiates communities. In microbiome science a community is almost a synonym with sample.

The output from running a beta diversity analysis is matrix with dissimilarities/distances between samples, such that for each pair of samples we will have 1 value denoting how similar or dissimilar the two samples are.

There are tons of different metrics for estimating beta diversity, and only a small subset (the most used) of the metrics will be presented here.

The three metrics included in this notebook, can be seen as equivalent to the three metrics presented in the alpha diversity notebook:

Alpha diversity Beta diversity Similarities
Observed richness equivalent to Jaccard Based on presence/absence of taxa
Shannon diversity equivalent to Bray-curtis Based on relative abundances of taxa
Phylogenetic diversity equivalent to UniFrac Uses the phylogenetic tree
In [1]:
# Load phyloseq object and phyloseq package
library(phyloseq)
load("../data/physeq.RData")

Jaccard

Jaccard dissimilarity is a simple dissimilarity metric, which is based solely on presence and absence of taxa:

Jaccard similarity is the the number of taxa that are shared in the two samples, divided by the number of taxa found in either of the two samples. Jaccard dissimilarity is then 1 minus jaccard similarity. Or more formally: \begin{equation*} D_{jaccard} = 1 - \frac {|A \cap B|} {|A \cup B|} \end{equation*} where A and B are the sets of taxa in the samples.

If the two samples have the exact same taxa, Jaccard dissimilarity will be 0. If the two samples do not share any taxa, Jaccard dissimilarity will be 1.

Note that as Jaccard is based on presence/absence of taxa, it is sensitive to the sequencing depth of the samples. If sequencing depths vary alot it is advised to rarefy before using the Jaccard dissimilarity, or in other ways account for the bias created by varying sequencing depths.

In R:

In [2]:
jac <- distance(phy, method="binary")

Let's look at the first five samples:

(the output is a dist object. It is easier to look at if we convert it to a matrix)

In [3]:
as.matrix(jac)[1:5, 1:5]
A matrix: 5 × 5 of type dbl
S1S2S3S4S5
S10.00000000.90384620.87755100.80487800.8222222
S20.90384620.00000000.90196080.93617020.9200000
S30.87755100.90196080.00000000.85714290.9600000
S40.80487800.93617020.85714290.00000000.8780488
S50.82222220.92000000.96000000.87804880.0000000

As expected, the diagonals are 0, because it is the Jaccard dissimilarity for each sample with itself. The remaining are quite high, showing that few taxa are shared across these samples.

Bray-Curtis

Bray-Curtis is a commonly used dissimilarity metric, which is based on the shared relative abundance of taxa between the samples.

It is calculated using the formula: \begin{equation*} D_{bray-curtis} = 1 - \frac {2*C_{ij}} {S_i + S_j} \end{equation*} where Si is the number of taxa in sample i, Sj is the number of taxa in sample j. To calculate Cij, all the shared taxa are found, then for each shared taxa the lowest relative abundance is found, and Cij is the sum of these relative abundances.

With Bray-Curtis, two samples can have a high dissimilarity even though they share most taxa, if the shared taxa have low relative abundance in any of the two samples.

Conversely, two samples can have a low Bray-Curtis dissimilarity even though they share few taxa, if these shared taxa have high relative abundance in both samples.

Bray-Curtis is therefore less sensitive to noise and sequencing depth than Jaccard dissimilarity.

In R:

In [4]:
bc <- distance(phy, method="bray")

Let's look at the first five samples:

In [5]:
as.matrix(bc)[1:5, 1:5]
A matrix: 5 × 5 of type dbl
S1S2S3S4S5
S10.00000000.73584040.84237440.73048510.6678875
S20.73584040.00000000.70138320.57596930.8606170
S30.84237440.70138320.00000000.43497180.9801876
S40.73048510.57596930.43497180.00000000.8956605
S50.66788750.86061700.98018760.89566050.0000000

UniFrac

UniFrac is popular dissimilarity metric, and it is based on the phylogenetic tree. There are two variants of UniFrac, an unweighted and a weighted version. The only difference between these is that the unweighted is based on presence/absence of taxa, and the weighted is incorporating the relative abundances of the taxa.

UniFrac measures dissimilarity by finding the total branch length of all taxa that not shared in the two samples, this is divided by the total branch length in the phylogenetic tree:

UniFrac example

In the weighted UniFrac, the branch lenghts are weighted by the relative abundances of the taxa.

With UniFrac, two samples can have a low dissimilarity even though no taxa are shared at all. But if the taxa in the two samples are closely related (close on the phylogenetic tree), the samples will be similar with the UniFrac.

Note that with unweighted UniFrac outliers can have a large influence. For example, a single Archaeal taxa in a tree with mostly Bacteria, will have a long branch length and thereby a large influence on the UniFrac dissimilarity.

Furthermore, UniFrac will change if the tree is rooted differently. Make sure the tree is correctly rooted before using the UniFrac dissimilarity.

In R:

In [6]:
uf <- UniFrac(phy, weighted=FALSE)

Let's look at the first five samples:

In [7]:
as.matrix(uf)[1:5, 1:5]
A matrix: 5 × 5 of type dbl
S1S2S3S4S5
S10.00000000.57010340.55226240.46285080.4256619
S20.57010340.00000000.62220920.58778810.6665934
S30.55226240.62220920.00000000.55753320.6027082
S40.46285080.58778810.55753320.00000000.6451262
S50.42566190.66659340.60270820.64512620.0000000

And the weighted UniFrac:

In [8]:
wuf <- UniFrac(phy, weighted=TRUE)
as.matrix(wuf)[1:5, 1:5]
A matrix: 5 × 5 of type dbl
S1S2S3S4S5
S10.00000000.32071400.361894970.318685360.2293554
S20.32071400.00000000.263718590.238807390.3286018
S30.36189500.26371860.000000000.095786850.3784816
S40.31868540.23880740.095786850.000000000.3481293
S50.22935540.32860180.378481600.348129270.0000000

It is apparent that the weighted UniFrac dissimilarities are much lower than all the other beta diversity metrics. Even though few taxa are shared across the samples, the most abundant taxa in the samples are closely related, and the samples are therefore phylogenetically similar.

Ordination

The usual way to visualize beta diversity is through ordination plots. PCoA is the most common ordination used for dissimilarity matrices and is explained in detail in the PCA, PCoA, PERMANOVA notebook.

In R:

There is a convinient function which does the ordination and plotting for us. Here we ordinate the unweighted UniFrac matrix and color by Time. The ellipses highlight the time points.

In [9]:
library(ggplot2)

wuf_ord <- ordinate(phy, "PCoA", "UniFrac", weighted=FALSE)

plot_ordination(phy, wuf_ord, type="samples", color="Time") +
    stat_ellipse()

PERMANOVA

PERMANOVA is the most used method to test statistical hypothesis on beta diversities. See more details in the PCA, PCoA, PERMANOVA notebook.

In R:

In [10]:
library(vegan)
adonis(jac ~ Time, data = data.frame(sample_data(phy)))
Indlæser krævet pakke: permute

Indlæser krævet pakke: lattice

This is vegan 2.5-7

Call:
adonis(formula = jac ~ Time, data = data.frame(sample_data(phy))) 

Permutation: free
Number of permutations: 999

Terms added sequentially (first to last)

           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
Time        2     4.320 2.16015  5.3231 0.06753  0.001 ***
Residuals 147    59.654 0.40581         0.93247           
Total     149    63.974                 1.00000           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1