With a large set of samples we can infer co-occurrence and co-exclusion patterns between taxa. The interpretation of these associations depends on the experimental design. Say we include oral and fecal samples from human subjects, then the associations would largely be dominated by the environmental preference of the taxa, and the network would likely have two clusters; an oral cluster and a fecal cluster. These networks are often created to infer microbial interactions, but it is difficult to dicern between the effects of interactions on co-occurrence and the effect of shared environment (microbes preferring the same nutrients or aboitic condititions are likely to co-exist). Association networks can probably only be used to infer interactions if great care is taken in the experimental design. Read more here.
Networks can be based on correlations where we calculate the correlation between each pair of taxa, independent of all the other taxa. These correlations are not affected by which taxa we include in the network, except that the p-value multiple correction is affected by the number of included taxa.
Examples of correlation network methods:
With inverse covariance we seek to remove indirect associations. Say that taxa A has a positive effect on taxa B, and taxa B has a positive effect on taxa C, but taxa A has no effect on taxa C, and vice versa. Taxa A and C would look to be associated, but this is an indirect effect, mediated by taxa B. Inverse covariance seeks to remove these, and all associations are therefore contingent on which taxa are included in the network
Examples of inverse covariance methods:
library(phyloseq)
library(DAtest)
load("../data/physeq.RData")
# Agglomorate to Genus level
phy_genus <- tax_glom(phy, "Genus")
# Only keep genera present in at least 10 samples
phy_genus_pre <- preDA(phy_genus, min.samples = 10)
# Extract count table:
otu <- data.frame(otu_table(phy_genus_pre))
The simplest network is to run spearman correlations on all pairs of taxa on relative abundances.
Transform to relative abundances
otu_rel <- apply(otu, 2, function(x) x/sum(x))
Calculate correlations
net <- cor(t(otu_rel), method = "spearman")
Set redundant correlations to NA
net[lower.tri(net)] <- NA
Transform the correlation matrix to a data.frame:
net_df <- as.data.frame.table(net)
Remove correlations between similar taxa, and the redundant ones:
net_df <- net_df[net_df$Var1 != net_df$Var2 & !is.na(net_df$Freq), ]
Remove correlations below 0.6.
net_df <- net_df[abs(net_df$Freq) > 0.6, ]
This data.frame can then be used to plot the network in for example Gephi, CytoScape, or with ggnet2 in R
colnames(net_df) <- c("Source", "Target", "Cor")
net_df$Weight <- ifelse(net_df$Cor > 0, net_df$Cor, 0)
net_df$Sign <- ifelse(net_df$Cor > 0, "Positive", "Negative")
net_df
The weight can be used for clustering algorithms, for example Force Atlas in Gephi. In Gephi, negative weights are not supported, so we set all negative correlations to zero (they have no influence on clustering).
Proportionality is a correlation method specifically developed for compositional microbial data. It is included in the MicEco package and can be run directly on a phyloseq object:
library(MicEco)
net <- proportionality(phy_genus_pre)
We can then run the same steps as for the Spearman correlation:
net_df <- as.data.frame.table(net)
net_df <- net_df[net_df$Var1 != net_df$Var2, ]
net_df <- net_df[abs(net_df$Freq) > 0.6, ]
As for the Spearman correaltion, you can choose to just set an arbitrary correlation cut-off for including correlations, for example 0.6. It is also possible to calculate p-values through permutations if this is prefered.
colnames(net_df) <- c("Source", "Target", "Cor")
net_df$Weight <- ifelse(net_df$Cor > 0, net_df$Cor, 0)
net_df$Sign <- ifelse(net_df$Cor > 0, "Positive", "Negative")
net_df
SparCC is a correlation method which is using various compositional transformations and bootstrapping methods to give reliable correlations on compositional data. The original implemetation is in python, but it is included in the SPIEC-EASI R package.
library(SpiecEasi)
SparCC bootstraps (be patient..., it is best to use 1000 or 10000 bootstraps, but it takes too long for this example)
sparcc_boot <- sparccboot(t(otu), R = 100)
SparCC p-values
sparcc_pval <- pval.sparccboot(sparcc_boot)
Turn into a data.frame with taxa names
# Correlation data.frame
net <- matrix(NA, nrow(otu), nrow(otu))
net[upper.tri(net, diag=FALSE)] <- sparcc_pval$cors
rownames(net) <- colnames(net) <- rownames(otu)
net_df <- as.data.frame.table(net)
# P-value data.frame
netp <- matrix(NA, nrow(otu), nrow(otu))
netp[upper.tri(netp, diag=FALSE)] <- sparcc_pval$pval
rownames(netp) <- colnames(netp) <- rownames(otu)
netp_df <- as.data.frame.table(netp)
# Combine
sparcc_df <- merge(net_df, netp_df, by = c("Var1", "Var2"))
colnames(sparcc_df) <- c("Source", "Target", "Cor", "Pval")
# Remove redundant
sparcc_df <- sparcc_df[!is.na(sparcc_df$Cor), ]
Add adjusted p-values
sparcc_df$Pval_adj <- p.adjust(sparcc_df$Pval, method = "fdr")
Keep only significant
sparcc_df <- sparcc_df[sparcc_df$Pval_adj <= 0.05, ]
SPIEC-EASI is a method based on graphical model inference developed specifically for microbial sequence data. There are 3 steps in the SPIEC-EASI analysis: data transformation, graphical model inference, model selection.
SPIEC-EASI might at first seem much more complicated than the other methods, but it has a big strength in that one does not have to choose an arbitrary correlation or p-value cut-off for determning inclusion of edges, SPIEC-EASI tries to estimate this for us automatically.
Data are CLR-transformed, similar to what is done prior to calculating Proportionality.
The graphical model can be inferred through either neighborhood selection (mb) or inverse covariance selection (glasso). Inverse covariance is in itself simple to estimate, but since we are usually having many more taxa than samples, we cannot infer inverse covariance through the usual calculations. We therefore use methods for sparse inverse covariance estimation, relying on the assumption that most covariances are small or zero. The amount of sparsity is controlled with a lambda parameter. Both the mb and the glasso methods estimate models over a range of lambda values, and we need a way to pick the final model to use.
SPIEC-EASI uses the (bounded) stars criterion for model selection, to choose the lambda parameter which produces the best graphical model.
In R:
se <- spiec.easi(phy_genus_pre, method='glasso', lambda.min.ratio=1e-3, nlambda=30,
sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=list(rep.num=50, seed=42))
# Get covariances and transform to correlations
netc <- cov2cor(as.matrix(getOptCov(se)))
rownames(netc) <- colnames(netc) <- taxa_names(phy_genus_pre)
netc_df <- as.data.frame.table(netc)
# Get chosen edges
net <- as.matrix(getRefit(se))
rownames(net) <- colnames(net) <- taxa_names(phy_genus_pre)
net_df <- as.data.frame.table(net)
# Combine
se_df <- merge(netc_df, net_df, by = c("Var1", "Var2"))
colnames(se_df) <- c("Source", "Target", "Cor", "Selected")
Subset to only the chosen edges
se_df <- se_df[se_df$Selected == 1, ]
Note that since the covariances are penalized, they are much smaller than from the other methods, and not directly comparable. They can, however, still be used as weights when plotting the network.
You can plot the networks as a heatmap with the pheatmap package:
library(pheatmap)
net <- proportionality(phy_genus_pre)
pheatmap(net)