Phyloseq operations¶

Phyloseq is a package made for organizing and working with microbiome data in R. With the phyloseq package we can have all our microbiome amplicon sequence data in a single R object. With functions from the phyloseq package, most common operations for preparing data for analysis is possible with few simple commands.

This document is an overview on how phyloseq objects are organized and how they can be changed.

The paper presenting phyloseq: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217

A comprehensive documetation of the phyloseq package: https://joey711.github.io/phyloseq/

# Load package and phyloseq object
library(phyloseq)
load("../data/physeq.RData")

Subset samples ¶

We can subset the samples with the subset_samples function. We can subset based on any column in the sample_data:

sample_variables(phy)

First argument to the subset_samples() function is the phyloseq object we want to subset. In the second argument we tell the function how to subset. Here we get the 1 week (1w) samples (always use two = signs)

phy_1w <- subset_samples(phy, Time == "1w")

Now we only have the 50 1 week samples in phy_1w

phy_1w

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 50 samples ]
sample_data() Sample Data:       [ 50 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

We can also subset both 1 week and 1 month samples:

phy_1w1m <- subset_samples(phy, Time %in% c("1w", "1m"))
phy_1w1m

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 100 samples ]
sample_data() Sample Data:       [ 100 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

We can also subset on different variables at the same time. Here we only take 1 week samples from children born by C-section (Use & for and, use | for or):

phy_1wS <- subset_samples(phy, Time == "1w" & Delivery == "Sectio")
phy_1wS

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 25 samples ]
sample_data() Sample Data:       [ 25 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Handling NAs¶

If you have NAs, you will often encounter problems when subsetting, so it's often a good idea to remove those with NAs before subsetting further. Below we assume that for some of the children the delivery mode is unknown (NA):

phy_nona <- subset_samples(phy, !is.na(Delivery))
phy_sectio <- subset_samples(phy_nona, Delivery == "Sectio")

This can also be done in one line:

phy_sectio <- subset_samples(phy, !is.na(Delivery) & Delivery == "Sectio")

Checking your output¶

It's always a good idea to check that you get the expected output. We can use the table() function to count the number of samples in each group. First look at the original phyloseq:

with(sample_data(phy), table(Time, Delivery))

    Delivery
Time Sectio Vaginal
  1m     25      25
  1w     25      25
  1y     25      25

Let's look at the sectio subset made above, and ensure that we only have sectio samples:

with(sample_data(phy_sectio), table(Time, Delivery))

    Delivery
Time Sectio
  1m     25
  1w     25
  1y     25

Prune samples ¶

We can also subset samples based on how many reads each sample have. sample_sums(phy) outputs the number of reads for each sample. Here we subset samples that have more than 5000 reads, and we can see that 10 samples have been thrown away:

phy_5k <- prune_samples(sample_sums(phy) > 5000, phy)
phy_5k

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 140 samples ]
sample_data() Sample Data:       [ 140 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Subset taxa ¶

In the same way as we can subset samples, we can also subset taxa. E.g. only Firmicutes:

phy_1wfirms <- subset_taxa(phy_1w, Phylum == "Firmicutes")

We can subset based on all the different taxonomic ranks:

rank_names(phy)

Notice that we ran the above subset command on the phy_1w object that we created earlier. Now we only have 1 week samples and only Firmicutes ASVs. We can chain together all the different subsetting commands together to get exactly the subset of samples and taxa we want.

Prune taxa ¶

We can also prune taxa by how abundant they are. A convenient function to do this is ps_prune from the MicEco package. Load the package first:

library(MicEco)

We can filter low abundant taxa based on three criteria:

They should be present in a minimum amount of samples (min.samples)
They should have a minimum amount of reads (min.reads)
They should have a minimum average relative abundance (min.abundance)

You don't have to use all three criteria. The filtered taxa are grouped in a new taxa called "Others".

Below we only want taxa that are:

at least present in 5 samples
at least have a total of 10 reads

phy_abund <- ps_prune(phy, min.samples = 5, min.reads = 10)
phy_abund

985 features grouped as 'Others' in the output

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 326 taxa and 150 samples ]
sample_data() Sample Data:       [ 150 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 326 taxa by 7 taxonomic ranks ]

Note on pruning taxa: There are, unfortunately, no standards on how to set the thresholds when pruning low abundant ASVs. It is usually done before differential abundance analyses to lower the number of features tested. The thresholds depends on the dataset and the hypotheses you want to test. Pruning low abundant taxa is usually not done prior to alpha or beta-diversity analyses.

Transform abundance ¶

Amplicon data is relative data

Most of the time, we therefore want to transform or normalize the raw read counts. We can transform abundances with transform_sample_counts(). We have to give it a function which tells it how to transform the abundance for each sample. The most simple way to do this is relative abundance (everything sums to one):

phy_rel <- transform_sample_counts(phy, function(x) x/sum(x))

Let's look at the first 5 ASVs and 10 first samples. Now the otu_table contains relative abundances:

otu_table(phy_rel)[1:5, 1:10]

and the sum for each sample is 1 (100%):

sample_sums(phy_rel)

New variables ¶

We often want to make new variables based on a single or a combination of existing variables. We might want to force a continuous variable into a binary, such as low/high BMI. Or combine variables, such as making a "Pet" variable if either the "Cat" or "Dog" variable is TRUE. How you "code" you variables depend on the hypothesis. Below are some examples of making new variables.

There are three steps in making a new variable:

Extract sample_data to a data.frame
Add the new variable(s)
Put the new sample_data back into the phyloseq object

# First step:
metadata <- data.frame(sample_data(phy))

Combine levels of the same factor¶

# Here we make a new Time variable where we combine the 1w and 1m samples to a level we call "Early"
# ifelse takes three arguments: A logical, what to return if the logical is TRUE, what to return if the logical is FALSE
metadata$Time_new <- ifelse(metadata$Time == "1y", "Late", "Early")

Combine levels of different factors¶

# Here we combine the Time and Delivery variable to make a new variable.
metadata$New_variable <- ifelse(metadata$Time == "1y" & metadata$Delivery == "Sectio", "1y sectio", "Not 1y sectio")

Continuous variable to binary variable¶

# As there are no continuous variables in this example, I use the total read counts instead
metadata$Reads_binary <- ifelse(sample_sums(phy) > 10000, "High", "Low")

Continuous variable to categorical variable¶

# Here we nest the ifelse() functions, so if the first logical is TRUE, then it is run through another ifelse()
metadata$Reads_cat <- ifelse(sample_sums(phy) > 10000, ifelse(sample_sums(phy) > 20000, "Very high", "High"), "Low")

Put back into phyloseq¶

sample_data(phy) <- sample_data(metadata)

Tax agglomoration ¶

It is often necessary to group counts of ASVs according to higher taxonomic levels. E.g. if we want to know how abundant different genera are, or we want to plot the most abundant phyla. We use the tax_glom function to do this

Here we agglomorate to Phylum level:

phy_phylum <- tax_glom(phy, "Phylum")
phy_phylum

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 17 taxa and 150 samples ]
sample_data() Sample Data:       [ 150 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 17 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 17 tips and 16 internal nodes ]
refseq()      DNAStringSet:      [ 17 reference sequences ]

We see in the output that we have 17 taxa. This is because we have 17 different phyla. Let's see how the otu_table looks like (only 10 first samples):

otu_table(phy_phylum)[, 1:10]

Note: The ASVs are not the same as before. You can see what the new "Phylum-ASVs" correspond to in the tax_table:

tax_table(phy_phylum)

Plotting abundances ¶

Let's put together some different functions to plot the most abundant families in the 1 week samples

Agglomorate to familes:

phy_fam <- tax_glom(phy_1w, "Family")

Transform to relative abundance:

phy_fam_rel <- transform_sample_counts(phy_fam, function(x) x/sum(x))

Filter low abundant:

phy_fam_rel_abund <- ps_prune(phy_fam_rel, min.abundance = 0.03)

113 features grouped as 'Others' in the output

We can transform the whole phyloseq object into a data.frame useful for plotting:

phy_df <- psmelt(phy_fam_rel_abund)

Plot bar chart¶

library(ggplot2)

geom_bar makes a barchart
fill tells ggplot how to color the bars
All the filtered taxa are grouped as NA

p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
  theme_bw() +
  geom_bar(stat = "identity")
p

We can use facet's to split the plot depending on Delivery mode. And we angle the x labels and make them smaller.

p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
  theme_bw() +
  geom_bar(stat = "identity") +
  facet_grid(~ Delivery, space = "free", scales = "free") +
  theme(axis.text.x = element_text(angle=90, size=6))
p

	S1	S4
dc467f0f8b8aa389aa106d751bb9a569	0.000000000	0.000000000
c387bc64fb22cd96d2b79dbfa932ce1e	0.000000000	0.000000000
42a23e6f4764f572f4d7c6d8e08769c3	0.000000000	0.000000000
2ea17744c7eeab459b7f41d4f9e22894	0.001002609	0.007007981
332ef16f5660bfe8ecaabda3404fc08b	0.000000000	0.000000000

	S1	S2	S3	S4	S5	S6	S7	S8	S9	S10
2ea17744c7eeab459b7f41d4f9e22894	83	37	0	468	0	0	0	0	0	0
e5199a623272b9b25c65f0455a1cd77b	0	0	0	0	0	0	0	0	0	0
4c304a27bc0520a7c398410713645502	0	0	0	0	0	0	0	0	0	0
6ec6d03fbef9f16e3581ccdc60e7d266	10687	24637	61061	41594	483	13500	2625	12337	7375	16225
7e8a6b8b1cad81e2fb27e397921a3c3b	0	0	0	0	0	0	0	0	0	0
7c928c5109b32c792d73dce9122b80a9	0	0	0	0	0	0	0	0	0	0
8600bbb0e5ffe0a260abd39547d07c68	0	0	0	0	0	0	0	0	0	0
b2495dec275b068c7545b642c4322cd7	0	0	0	0	0	0	0	0	0	0
0eca810e771f78df0bf7f7f92dc873f0	0	0	0	0	0	0	0	0	0	0
98ca3e41d8d589d9d94aad956b84e054	0	0	0	0	0	0	0	0	0	270
4e8b51098d3b598eadf069dcc96e81aa	16312	57879	6498	11113	60790	37404	13393	11109	10451	42325
3d53a81dc0bd2aed0641d255cbf060a3	0	0	0	0	4	0	0	0	5	13
08b21f8171b35ef001832ba1df9b8fbc	1545	2904	6687	14	5915	365	282	0	72	2621
cde00646e8aecf8aaac49a9bb9c96729	11951	3512	1110	3763	5576	349	1673	4818	3294	1652
b6b05223adf86d071fd279f79dc2533c	0	0	0	0	0	0	0	0	0	0
d20d3658de331c939f54d8acaed7c4c1	35010	1740	1253	9225	20319	4120	3846	21955	20354	2112
8c3ff6c4c4b125d5e72f5276d57be4d0	7196	258	10351	604	244	3747	1485	13519	9942	528

	Kingdom	Phylum	Class	Order	Family	Genus	Species
2ea17744c7eeab459b7f41d4f9e22894	Bacteria	Fusobacteriota	NA	NA	NA	NA	NA
e5199a623272b9b25c65f0455a1cd77b	Bacteria	Deinococcota	NA	NA	NA	NA	NA
4c304a27bc0520a7c398410713645502	Bacteria	Cyanobacteriota	NA	NA	NA	NA	NA
6ec6d03fbef9f16e3581ccdc60e7d266	Bacteria	Actinobacteriota	NA	NA	NA	NA	NA
7e8a6b8b1cad81e2fb27e397921a3c3b	Bacteria	Myxococcota	NA	NA	NA	NA	NA
7c928c5109b32c792d73dce9122b80a9	Bacteria	Chloroflexota	NA	NA	NA	NA	NA
8600bbb0e5ffe0a260abd39547d07c68	Bacteria	Acidobacteriota	NA	NA	NA	NA	NA
b2495dec275b068c7545b642c4322cd7	Bacteria	Planctomycetota	NA	NA	NA	NA	NA
0eca810e771f78df0bf7f7f92dc873f0	Bacteria	Patescibacteria	NA	NA	NA	NA	NA
98ca3e41d8d589d9d94aad956b84e054	Bacteria	Desulfobacterota	NA	NA	NA	NA	NA
4e8b51098d3b598eadf069dcc96e81aa	Bacteria	Proteobacteria	NA	NA	NA	NA	NA
3d53a81dc0bd2aed0641d255cbf060a3	Bacteria	Campylobacterota	NA	NA	NA	NA	NA
08b21f8171b35ef001832ba1df9b8fbc	Bacteria	Firmicutes_C	NA	NA	NA	NA	NA
cde00646e8aecf8aaac49a9bb9c96729	Bacteria	Firmicutes	NA	NA	NA	NA	NA
b6b05223adf86d071fd279f79dc2533c	Bacteria	Verrucomicrobiota	NA	NA	NA	NA	NA
d20d3658de331c939f54d8acaed7c4c1	Bacteria	Bacteroidota	NA	NA	NA	NA	NA
8c3ff6c4c4b125d5e72f5276d57be4d0	Bacteria	Firmicutes_A	NA	NA	NA	NA	NA

Phyloseq operations¶

Subset samples¶

Handling NAs¶

Checking your output¶

Prune samples¶

Subset taxa¶

Prune taxa¶

Transform abundance¶

New variables¶