Phyloseq operations

Phyloseq is a package made for organizing and working with microbiome data in R. With the phyloseq package we can have all our microbiome amplicon sequence data in a single R object. With functions from the phyloseq package, most common operations for preparing data for analysis is possible with few simple commands.

This document is an overview on how phyloseq objects are organized and how they can be changed.

The paper presenting phyloseq: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217

A comprehensive documetation of the phyloseq package: https://joey711.github.io/phyloseq/

In [1]:
# Load package and phyloseq object
library(phyloseq)
load("../data/physeq.RData")

Subset samples

We can subset the samples with the subset_samples function. We can subset based on any column in the sample_data:

In [2]:
sample_variables(phy)
  1. 'Patient'
  2. 'Time'
  3. 'Delivery'

First argument to the subset_samples() function is the phyloseq object we want to subset. In the second argument we tell the function how to subset. Here we get the 1 week (1w) samples (always use two = signs)

In [3]:
phy_1w <- subset_samples(phy, Time == "1w")

Now we only have the 50 1 week samples in phy_1w

In [4]:
phy_1w
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 50 samples ]
sample_data() Sample Data:       [ 50 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

We can also subset both 1 week and 1 month samples:

In [5]:
phy_1w1m <- subset_samples(phy, Time %in% c("1w", "1m"))
phy_1w1m
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 100 samples ]
sample_data() Sample Data:       [ 100 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

We can also subset on different variables at the same time. Here we only take 1 week samples from children born by C-section (Use & for and, use | for or):

In [6]:
phy_1wS <- subset_samples(phy, Time == "1w" & Delivery == "Sectio")
phy_1wS
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 25 samples ]
sample_data() Sample Data:       [ 25 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Handling NAs

If you have NAs, you will often encounter problems when subsetting, so it's often a good idea to remove those with NAs before subsetting further. Below we assume that for some of the children the delivery mode is unknown (NA):

In [7]:
phy_nona <- subset_samples(phy, !is.na(Delivery))
phy_sectio <- subset_samples(phy_nona, Delivery == "Sectio")

This can also be done in one line:

In [9]:
phy_sectio <- subset_samples(phy, !is.na(Delivery) & Delivery == "Sectio")

Checking your output

It's always a good idea to check that you get the expected output. We can use the table() function to count the number of samples in each group. First look at the original phyloseq:

In [10]:
with(sample_data(phy), table(Time, Delivery))
    Delivery
Time Sectio Vaginal
  1m     25      25
  1w     25      25
  1y     25      25

Let's look at the sectio subset made above, and ensure that we only have sectio samples:

In [11]:
with(sample_data(phy_sectio), table(Time, Delivery))
    Delivery
Time Sectio
  1m     25
  1w     25
  1y     25

Prune samples

We can also subset samples based on how many reads each sample have. sample_sums(phy) outputs the number of reads for each sample. Here we subset samples that have more than 5000 reads, and we can see that 10 samples have been thrown away:

In [12]:
phy_5k <- prune_samples(sample_sums(phy) > 5000, phy)
phy_5k
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 140 samples ]
sample_data() Sample Data:       [ 140 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Subset taxa

In the same way as we can subset samples, we can also subset taxa. E.g. only Firmicutes:

In [13]:
phy_1wfirms <- subset_taxa(phy_1w, Phylum == "Firmicutes")

We can subset based on all the different taxonomic ranks:

In [14]:
rank_names(phy)
  1. 'Kingdom'
  2. 'Phylum'
  3. 'Class'
  4. 'Order'
  5. 'Family'
  6. 'Genus'
  7. 'Species'

Notice that we ran the above subset command on the phy_1w object that we created earlier. Now we only have 1 week samples and only Firmicutes ASVs. We can chain together all the different subsetting commands together to get exactly the subset of samples and taxa we want.

Prune taxa

We can also prune taxa by how abundant they are. A convenient function to do this is ps_prune from the MicEco package. Load the package first:

In [15]:
library(MicEco)

We can filter low abundant taxa based on three criteria:

  • They should be present in a minimum amount of samples (min.samples)
  • They should have a minimum amount of reads (min.reads)
  • They should have a minimum average relative abundance (min.abundance)

You don't have to use all three criteria. The filtered taxa are grouped in a new taxa called "Others".

Below we only want taxa that are:

  • at least present in 5 samples
  • at least have a total of 10 reads
In [16]:
phy_abund <- ps_prune(phy, min.samples = 5, min.reads = 10)
phy_abund
985 features grouped as 'Others' in the output

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 326 taxa and 150 samples ]
sample_data() Sample Data:       [ 150 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 326 taxa by 7 taxonomic ranks ]

Note on pruning taxa: There are, unfortunately, no standards on how to set the thresholds when pruning low abundant ASVs. It is usually done before differential abundance analyses to lower the number of features tested. The thresholds depends on the dataset and the hypotheses you want to test. Pruning low abundant taxa is usually not done prior to alpha or beta-diversity analyses.

Transform abundance

Amplicon data is relative data

Most of the time, we therefore want to transform or normalize the raw read counts. We can transform abundances with transform_sample_counts(). We have to give it a function which tells it how to transform the abundance for each sample. The most simple way to do this is relative abundance (everything sums to one):

In [17]:
phy_rel <- transform_sample_counts(phy, function(x) x/sum(x))

Let's look at the first 5 ASVs and 10 first samples. Now the otu_table contains relative abundances:

In [18]:
otu_table(phy_rel)[1:5, 1:10]
A otu_table: 5 × 10 of type dbl
S1S2S3S4S5S6S7S8S9S10
dc467f0f8b8aa389aa106d751bb9a5690.000000000000.000000000000000
c387bc64fb22cd96d2b79dbfa932ce1e0.000000000000.000000000000000
42a23e6f4764f572f4d7c6d8e08769c30.000000000000.000000000000000
2ea17744c7eeab459b7f41d4f9e228940.001002609000.007007981000000
332ef16f5660bfe8ecaabda3404fc08b0.000000000000.000000000000000

and the sum for each sample is 1 (100%):

In [19]:
sample_sums(phy_rel)
S1
1
S2
1
S3
1
S4
1
S5
1
S6
1
S7
1
S8
1
S9
1
S10
1
S11
1
S12
1
S13
1
S14
1
S15
1
S16
1
S17
1
S18
1
S19
1
S20
1
S21
1
S22
1
S23
1
S24
1
S25
1
S26
1
S27
1
S28
1
S29
1
S30
1
S31
1
S32
1
S33
1
S34
1
S35
1
S36
1
S37
1
S38
1
S39
1
S40
1
S41
1
S42
1
S43
1
S44
1
S45
1
S46
1
S47
1
S48
1
S49
1
S50
1
S51
1
S52
1
S53
1
S54
1
S55
1
S56
1
S57
1
S58
1
S59
1
S60
1
S61
1
S62
1
S63
1
S64
1
S65
1
S66
1
S67
1
S68
1
S69
1
S70
1
S71
1
S72
1
S73
1
S74
1
S75
1
S76
1
S77
1
S78
1
S79
1
S80
1
S81
1
S82
1
S83
1
S84
1
S85
1
S86
1
S87
1
S88
1
S89
1
S90
1
S91
1
S92
1
S93
1
S94
1
S95
1
S96
1
S97
1
S98
1
S99
1
S100
1
S101
1
S102
1
S103
1
S104
1
S105
1
S106
1
S107
1
S108
1
S109
1
S110
1
S111
1
S112
1
S113
1
S114
1
S115
1
S116
1
S117
1
S118
1
S119
1
S120
1
S121
1
S122
1
S123
1
S124
1
S125
1
S126
1
S127
1
S128
1
S129
1
S130
1
S131
1
S132
1
S133
1
S134
1
S135
1
S136
1
S137
1
S138
1
S139
1
S140
1
S141
1
S142
1
S143
1
S144
1
S145
1
S146
1
S147
1
S148
1
S149
1
S150
1

New variables

We often want to make new variables based on a single or a combination of existing variables. We might want to force a continuous variable into a binary, such as low/high BMI. Or combine variables, such as making a "Pet" variable if either the "Cat" or "Dog" variable is TRUE. How you "code" you variables depend on the hypothesis. Below are some examples of making new variables.

There are three steps in making a new variable:

  • Extract sample_data to a data.frame
  • Add the new variable(s)
  • Put the new sample_data back into the phyloseq object
In [20]:
# First step:
metadata <- data.frame(sample_data(phy))

Combine levels of the same factor

In [21]:
# Here we make a new Time variable where we combine the 1w and 1m samples to a level we call "Early"
# ifelse takes three arguments: A logical, what to return if the logical is TRUE, what to return if the logical is FALSE
metadata$Time_new <- ifelse(metadata$Time == "1y", "Late", "Early")

Combine levels of different factors

In [22]:
# Here we combine the Time and Delivery variable to make a new variable.
metadata$New_variable <- ifelse(metadata$Time == "1y" & metadata$Delivery == "Sectio", "1y sectio", "Not 1y sectio")

Continuous variable to binary variable

In [23]:
# As there are no continuous variables in this example, I use the total read counts instead
metadata$Reads_binary <- ifelse(sample_sums(phy) > 10000, "High", "Low")

Continuous variable to categorical variable

In [25]:
# Here we nest the ifelse() functions, so if the first logical is TRUE, then it is run through another ifelse()
metadata$Reads_cat <- ifelse(sample_sums(phy) > 10000, ifelse(sample_sums(phy) > 20000, "Very high", "High"), "Low")

Put back into phyloseq

In [26]:
sample_data(phy) <- sample_data(metadata)

Tax agglomoration

It is often necessary to group counts of ASVs according to higher taxonomic levels. E.g. if we want to know how abundant different genera are, or we want to plot the most abundant phyla. We use the tax_glom function to do this

Here we agglomorate to Phylum level:

In [27]:
phy_phylum <- tax_glom(phy, "Phylum")
phy_phylum
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 17 taxa and 150 samples ]
sample_data() Sample Data:       [ 150 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 17 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 17 tips and 16 internal nodes ]
refseq()      DNAStringSet:      [ 17 reference sequences ]

We see in the output that we have 17 taxa. This is because we have 17 different phyla. Let's see how the otu_table looks like (only 10 first samples):

In [28]:
otu_table(phy_phylum)[, 1:10]
A otu_table: 17 × 10 of type dbl
S1S2S3S4S5S6S7S8S9S10
2ea17744c7eeab459b7f41d4f9e22894 83 37 0 468 0 0 0 0 0 0
e5199a623272b9b25c65f0455a1cd77b 0 0 0 0 0 0 0 0 0 0
4c304a27bc0520a7c398410713645502 0 0 0 0 0 0 0 0 0 0
6ec6d03fbef9f16e3581ccdc60e7d26610687246376106141594 48313500 262512337 737516225
7e8a6b8b1cad81e2fb27e397921a3c3b 0 0 0 0 0 0 0 0 0 0
7c928c5109b32c792d73dce9122b80a9 0 0 0 0 0 0 0 0 0 0
8600bbb0e5ffe0a260abd39547d07c68 0 0 0 0 0 0 0 0 0 0
b2495dec275b068c7545b642c4322cd7 0 0 0 0 0 0 0 0 0 0
0eca810e771f78df0bf7f7f92dc873f0 0 0 0 0 0 0 0 0 0 0
98ca3e41d8d589d9d94aad956b84e054 0 0 0 0 0 0 0 0 0 270
4e8b51098d3b598eadf069dcc96e81aa1631257879 649811113607903740413393111091045142325
3d53a81dc0bd2aed0641d255cbf060a3 0 0 0 0 4 0 0 0 5 13
08b21f8171b35ef001832ba1df9b8fbc 1545 2904 6687 14 5915 365 282 0 72 2621
cde00646e8aecf8aaac49a9bb9c9672911951 3512 1110 3763 5576 349 1673 4818 3294 1652
b6b05223adf86d071fd279f79dc2533c 0 0 0 0 0 0 0 0 0 0
d20d3658de331c939f54d8acaed7c4c135010 1740 1253 922520319 4120 38462195520354 2112
8c3ff6c4c4b125d5e72f5276d57be4d0 7196 25810351 604 244 3747 148513519 9942 528

Note: The ASVs are not the same as before. You can see what the new "Phylum-ASVs" correspond to in the tax_table:

In [29]:
tax_table(phy_phylum)
A taxonomyTable: 17 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
2ea17744c7eeab459b7f41d4f9e22894BacteriaFusobacteriota NANANANANA
e5199a623272b9b25c65f0455a1cd77bBacteriaDeinococcota NANANANANA
4c304a27bc0520a7c398410713645502BacteriaCyanobacteriota NANANANANA
6ec6d03fbef9f16e3581ccdc60e7d266BacteriaActinobacteriota NANANANANA
7e8a6b8b1cad81e2fb27e397921a3c3bBacteriaMyxococcota NANANANANA
7c928c5109b32c792d73dce9122b80a9BacteriaChloroflexota NANANANANA
8600bbb0e5ffe0a260abd39547d07c68BacteriaAcidobacteriota NANANANANA
b2495dec275b068c7545b642c4322cd7BacteriaPlanctomycetota NANANANANA
0eca810e771f78df0bf7f7f92dc873f0BacteriaPatescibacteria NANANANANA
98ca3e41d8d589d9d94aad956b84e054BacteriaDesulfobacterota NANANANANA
4e8b51098d3b598eadf069dcc96e81aaBacteriaProteobacteria NANANANANA
3d53a81dc0bd2aed0641d255cbf060a3BacteriaCampylobacterota NANANANANA
08b21f8171b35ef001832ba1df9b8fbcBacteriaFirmicutes_C NANANANANA
cde00646e8aecf8aaac49a9bb9c96729BacteriaFirmicutes NANANANANA
b6b05223adf86d071fd279f79dc2533cBacteriaVerrucomicrobiotaNANANANANA
d20d3658de331c939f54d8acaed7c4c1BacteriaBacteroidota NANANANANA
8c3ff6c4c4b125d5e72f5276d57be4d0BacteriaFirmicutes_A NANANANANA

Plotting abundances

Let's put together some different functions to plot the most abundant families in the 1 week samples

Agglomorate to familes:

In [30]:
phy_fam <- tax_glom(phy_1w, "Family")

Transform to relative abundance:

In [31]:
phy_fam_rel <- transform_sample_counts(phy_fam, function(x) x/sum(x))

Filter low abundant:

In [32]:
phy_fam_rel_abund <- ps_prune(phy_fam_rel, min.abundance = 0.03)
113 features grouped as 'Others' in the output

We can transform the whole phyloseq object into a data.frame useful for plotting:

In [33]:
phy_df <- psmelt(phy_fam_rel_abund)

Plot bar chart

In [34]:
library(ggplot2)
  • geom_bar makes a barchart
  • fill tells ggplot how to color the bars
  • All the filtered taxa are grouped as NA
In [35]:
p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
  theme_bw() +
  geom_bar(stat = "identity")
p

We can use facet's to split the plot depending on Delivery mode. And we angle the x labels and make them smaller.

In [36]:
p <- ggplot(phy_df, aes(x = Sample, y = Abundance, fill = Family)) +
  theme_bw() +
  geom_bar(stat = "identity") +
  facet_grid(~ Delivery, space = "free", scales = "free") +
  theme(axis.text.x = element_text(angle=90, size=6))
p