Phyloseq is a package made for organizing and working with microbiome data in R. With the phyloseq package we can have all our microbiome amplicon sequence data in a single R object. With functions from the phyloseq package, most common operations for preparing data for analysis is possible with few simple commands.
This document is an overview on how phyloseq objects are organized and how they can be changed.
The paper presenting phyloseq: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217
A comprehensive documetation of the phyloseq package: https://joey711.github.io/phyloseq/
# Load package and phyloseq object
library(phyloseq)
load("../data/physeq.RData")
We can subset the samples with the subset_samples function. We can subset based on any column in the sample_data:
sample_variables(phy)
First argument to the subset_samples() function is the phyloseq object we want to subset. In the second argument we tell the function how to subset. Here we get the 1 week (1w) samples (always use two = signs)
phy_1w <- subset_samples(phy, Time == "1w")
Now we only have the 50 1 week samples in phy_1w
phy_1w
We can also subset both 1 week and 1 month samples:
phy_1w1m <- subset_samples(phy, Time %in% c("1w", "1m"))
phy_1w1m
We can also subset on different variables at the same time. Here we only take 1 week samples from children born by C-section (Use & for and, use | for or):
phy_1wS <- subset_samples(phy, Time == "1w" & Delivery == "Sectio")
phy_1wS
If you have NAs, you will often encounter problems when subsetting, so it's often a good idea to remove those with NAs before subsetting further. Below we assume that for some of the children the delivery mode is unknown (NA):
phy_nona <- subset_samples(phy, !is.na(Delivery))
phy_sectio <- subset_samples(phy_nona, Delivery == "Sectio")
This can also be done in one line:
phy_sectio <- subset_samples(phy, !is.na(Delivery) & Delivery == "Sectio")
It's always a good idea to check that you get the expected output. We can use the table() function to count the number of samples in each group. First look at the original phyloseq:
with(sample_data(phy), table(Time, Delivery))
Let's look at the sectio subset made above, and ensure that we only have sectio samples:
with(sample_data(phy_sectio), table(Time, Delivery))
We can also subset samples based on how many reads each sample have. sample_sums(phy) outputs the number of reads for each sample. Here we subset samples that have more than 5000 reads, and we can see that 10 samples have been thrown away:
phy_5k <- prune_samples(sample_sums(phy) > 5000, phy)
phy_5k
In the same way as we can subset samples, we can also subset taxa. E.g. only Firmicutes:
phy_1wfirms <- subset_taxa(phy_1w, Phylum == "Firmicutes")
We can subset based on all the different taxonomic ranks:
rank_names(phy)
Notice that we ran the above subset command on the phy_1w object that we created earlier. Now we only have 1 week samples and only Firmicutes ASVs. We can chain together all the different subsetting commands together to get exactly the subset of samples and taxa we want.
We can also prune taxa by how abundant they are. A convenient function to do this is ps_prune from the MicEco package. Load the package first:
library(MicEco)
We can filter low abundant taxa based on three criteria:
You don't have to use all three criteria. The filtered taxa are grouped in a new taxa called "Others".
Below we only want taxa that are:
phy_abund <- ps_prune(phy, min.samples = 5, min.reads = 10)
phy_abund
Note on pruning taxa: There are, unfortunately, no standards on how to set the thresholds when pruning low abundant ASVs. It is usually done before differential abundance analyses to lower the number of features tested. The thresholds depends on the dataset and the hypotheses you want to test. Pruning low abundant taxa is usually not done prior to alpha or beta-diversity analyses.
Amplicon data is relative data
Most of the time, we therefore want to transform or normalize the raw read counts. We can transform abundances with transform_sample_counts(). We have to give it a function which tells it how to transform the abundance for each sample. The most simple way to do this is relative abundance (everything sums to one):
phy_rel <- transform_sample_counts(phy, function(x) x/sum(x))
Let's look at the first 5 ASVs and 10 first samples. Now the otu_table contains relative abundances:
otu_table(phy_rel)[1:5, 1:10]
and the sum for each sample is 1 (100%):
sample_sums(phy_rel)
We often want to make new variables based on a single or a combination of existing variables. We might want to force a continuous variable into a binary, such as low/high BMI. Or combine variables, such as making a "Pet" variable if either the "Cat" or "Dog" variable is TRUE. How you "code" you variables depend on the hypothesis. Below are some examples of making new variables.
There are three steps in making a new variable:
# First step:
metadata <- data.frame(sample_data(phy))
# Here we make a new Time variable where we combine the 1w and 1m samples to a level we call "Early"
# ifelse takes three arguments: A logical, what to return if the logical is TRUE, what to return if the logical is FALSE
metadata$Time_new <- ifelse(metadata$Time == "1y", "Late", "Early")
# Here we combine the Time and Delivery variable to make a new variable.
metadata$New_variable <- ifelse(metadata$Time == "1y" & metadata$Delivery == "Sectio", "1y sectio", "Not 1y sectio")
# As there are no continuous variables in this example, I use the total read counts instead
metadata$Reads_binary <- ifelse(sample_sums(phy) > 10000, "High", "Low")
# Here we nest the ifelse() functions, so if the first logical is TRUE, then it is run through another ifelse()
metadata$Reads_cat <- ifelse(sample_sums(phy) > 10000, ifelse(sample_sums(phy) > 20000, "Very high", "High"), "Low")
sample_data(phy) <- sample_data(metadata)
It is often necessary to group counts of ASVs according to higher taxonomic levels. E.g. if we want to know how abundant different genera are, or we want to plot the most abundant phyla. We use the tax_glom function to do this
Here we agglomorate to Phylum level:
phy_phylum <- tax_glom(phy, "Phylum")
phy_phylum
We see in the output that we have 17 taxa. This is because we have 17 different phyla. Let's see how the otu_table looks like (only 10 first samples):
otu_table(phy_phylum)[, 1:10]
Note: The ASVs are not the same as before. You can see what the new "Phylum-ASVs" correspond to in the tax_table:
tax_table(phy_phylum)
Let's put together some different functions to plot the most abundant families in the 1 week samples
Agglomorate to familes:
phy_fam <- tax_glom(phy_1w, "Family")
Transform to relative abundance:
phy_fam_rel <- transform_sample_counts(phy_fam, function(x) x/sum(x))
Filter low abundant:
phy_fam_rel_abund <- ps_prune(phy_fam_rel, min.abundance = 0.03)
We can transform the whole phyloseq object into a data.frame useful for plotting:
phy_df <- psmelt(phy_fam_rel_abund)
library(ggplot2)
p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
theme_bw() +
geom_bar(stat = "identity")
p
We can use facet's to split the plot depending on Delivery mode. And we angle the x labels and make them smaller.
p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
theme_bw() +
geom_bar(stat = "identity") +
facet_grid(~ Delivery, space = "free", scales = "free") +
theme(axis.text.x = element_text(angle=90, size=6))
p