The phyloseq package

Phyloseq is a package made for organizing and working with microbiome data in R. With the phyloseq package we can have all our microbiome amplicon sequence data in a single R object. With functions from the phyloseq package, most common operations for preparing data for analysis is possible with few simple commands.

This document is an overview on how phyloseq objects are organized and how they can be accessed.

The paper presenting phyloseq: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217

A comprehensive documetation of the phyloseq package: https://joey711.github.io/phyloseq/

To work with phyloseq objects we first have to load the package

In [1]:
library(phyloseq)

The phyloseq object

Let's load our test dataset, and see how phyloseq is organized.

In [2]:
load("../data/physeq.RData")

If we print the name of the phyloseq object, we can see what it contains

In [3]:
phy
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 150 samples ]
sample_data() Sample Data:       [ 150 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

The phy object contains all our data and associated metadata. This is organized in 5 different sub-objects:

  • otu_table: Contains a matrix with the abundance of each taxa (ASV) for each sample
  • sample_data: Contains the metadata for each sample
  • tax_table: Contains the taxonomical annotation for each taxon (ASV)
  • phy_tree: Contains a phylogenetic tree
  • refseq: Contains sequences (16S rRNA gene sequence) for each taxon (ASV)

Note: "phy" is an arbitrary name, it could be anything else

Below is a section on each of the objects describing what they contain and how to access them.

otu_table

The otu_table contains the abundance of each OTU/ASV for each sample. We can see from above that it contains data for 1310 taxa and 150 samples. We can access it with the otu_table() function

In [4]:
otu_table(phy)
A otu_table: 1310 × 150 of type dbl
S1S2S3S4S5S6S7S8S9S10⋯S141S142S143S144S145S146S147S148S149S150
dc467f0f8b8aa389aa106d751bb9a569 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 019 00 0
c387bc64fb22cd96d2b79dbfa932ce1e 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
42a23e6f4764f572f4d7c6d8e08769c3 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
2ea17744c7eeab459b7f41d4f9e22894 83 0 0 4680 0 0 0 0 0⋯ 0 0 0 012 0 0 00 0
332ef16f5660bfe8ecaabda3404fc08b 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
0a39c4f216e370df5d1c10b2ac1f5286 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
2928490ca329c46e1649ff8d8c072efc 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 052 0 00 0
a805200f08abbfa1a4679a264e851398 0 37 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
095f00e04148b0db5e970c32232272f1 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
bb243e746b385279c86c7c8b365601d3 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
8557aacf799775fc8664c28e239764fc 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
48520a6dc94cba12ffb6452e1335b5c5 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
e5199a623272b9b25c65f0455a1cd77b 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
f3e368bdade206389143a53691dd3403 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
4ecaf2314cf477a3c8f559db0812131f 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
d00281a614fbe1834f8ae99f0ca8a4b8 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
4c304a27bc0520a7c398410713645502 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
10304568dd5843ac94284f138298496d 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
891a50a9672c9fe4fb071d53349a1087 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
cf2732becb3d1d66dfcaf9e85416a3bd 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
806c67148f3ec58a0f5a4c0aff5b18db 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
93d3df70f762ddfbb08ec06ae5ea894c 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 70 0
29f6c8a92557d221576220686357171d 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
d50b1472ab5d1b933604d13377ea8141 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
5805c09d874afccb8923b15c48bce80b 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
6ec6d03fbef9f16e3581ccdc60e7d26610564245804902041283013425231912337658816213⋯1956 0149811625 0 0 01870913
8acb42cec03ba45f8b7e7dbc90595383 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
327333264d2d3fd218350b8201a128fa 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
8534c358254b1ce9593b641dfc2cb314 0 0 0 00 0 0 0 0 0⋯ 064 0 0 0 0 0 00 0
57ac1b63c5ca76f2e29fb844c9f804e3 0 0 0 00 0 0 0 0 0⋯ 0 0 0 0 0 0 0 00 0
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
108aeef6545d4a8f2694af63c9b704bb0 00000 000 0⋯ 0 0000 00 0 0 0
35b60479cc1958c19f27e3e6266e3fcd0 00000 000 0⋯ 0 0000 00 0 0 0
05924d3a2fbe70d70ed0e7ab11d21c700 00000 000 0⋯ 0 0000 00 0 0 0
592ef2acca5a8061c8012f3bd4791c060 00000 000 0⋯ 0 0000 00 0 0 0
2f3bdd821f386dc4926db3596aeeeaca0 00000 000 0⋯ 0 0000 00 0 0 0
51653ce8e5aef59c159c8cc40e3a7a470 00000 000 0⋯ 0 0000 00 0 0 0
47376d68d57c18330cf01210f9dae3990 00000 000 0⋯ 0 0000 00 0 0 0
529eb774e3e9fe2a75c4e5e3fa08c0210 00000 000 0⋯ 0 0000 00 0 0 0
b79d081871ba6301d1b26d1d900e20a20 00000 000 0⋯ 5 12000 00 0 0 0
458d301b7e2fbbed866d30fbd47353ed0 00000 000 0⋯ 0 0000 00 0 0 0
ccd2d2e1ebdbda1ee4bf03fb8f4013070 00000 000 0⋯ 0 0000140 0 0 0
ae3db49f681a714480b1b6042e1203460 00000 000 0⋯ 0 0000 00 0 0 0
20bddea1062c34fe82e1cd18adfbf32b0 00000 000 0⋯ 0 0000 00 0 0 0
aa4045a865a1a92f2ad7c9911bd5bee70 00000 000 0⋯ 0 0000 00 0 0 0
c9cf0dccbc3516d0cd12bb9899048e120 00000 000 0⋯ 0 0000 00 0 0 0
3f05c76103eade8bd104421817eadad70 00000 000 0⋯ 0 0000 00 0 0 0
e2f7aa19c11e3adf96a7030321b7e2e40 00000 000 0⋯ 0368000 00 0152 0
a581dd58cef1f78d892b8bc14a1f5c4e0 00000 000 0⋯ 0 0000 0027 0 0
b16519ba8dcea827c9839d43ef1c15800 00000 000 0⋯ 0 0000 00 0 0 0
ce74dce376a5dedddcc0ff2342dc82e70 00000 000 0⋯ 0 0000 00 0 0 0
bd930ec718da906ce0ae573b6ed850220 00000 000 0⋯ 0 0000 00 0 0 0
36e52ff1c5961d932d931d0f148ed5910 00000 000 0⋯ 0 0000 00 0 0 0
ff011bda3c950839aa99faf6cedc0ec50 00000 000 0⋯ 0 0000 00 0 085
34e55c0f0706224ff8a1aca1ede0a5050 00000 000 0⋯ 0 0000 00 0 0 0
30ef39708ebcdcddbcdc2f2439c3c3980550000 000 0⋯ 0 0000 00 0 063
f43c36b22c1bc8798a8d7f25125919f60 00000770011⋯ 0 47000 00 0 0 0
7342bde0691e85a0a6fac0728be2aa560 00000 000 0⋯ 0 0000 00 0 0 0
b2c5f0d55453383ac1deda19b8c3bed20 00000 000 0⋯ 0 0000 00 0 0 0
ab8cf03d7d5eb6d8aedd296ab42254140 00000 000 0⋯33 0000 0012 0 0
e5e51d9940a3c45538fd7d9542c958eb0 00000 000 0⋯ 0 0000 00 0 85 0

Here we can see that ASV a805200f08abbfa1a4679a264e851398 was not detected in sample S1, but that 37 reads from sample S2 was assigned to that ASV, and so on.

We can subset specific taxa with the object[subset] notation

In [5]:
otu_table(phy)["6ec6d03fbef9f16e3581ccdc60e7d266"]
A otu_table: 1 × 150 of type dbl
S1S2S3S4S5S6S7S8S9S10⋯S141S142S143S144S145S146S147S148S149S150
6ec6d03fbef9f16e3581ccdc60e7d26610564245804902041283013425231912337658816213⋯195601498116250001870913
In [6]:
otu_table(phy)[c("6ec6d03fbef9f16e3581ccdc60e7d266", "a805200f08abbfa1a4679a264e851398")]
A otu_table: 2 × 150 of type dbl
S1S2S3S4S5S6S7S8S9S10⋯S141S142S143S144S145S146S147S148S149S150
6ec6d03fbef9f16e3581ccdc60e7d26610564245804902041283013425231912337658816213⋯195601498116250001870913
a805200f08abbfa1a4679a264e851398 0 37 0 00 0 0 0 0 0⋯ 00 0 0000 00 0

Similarly for samples by preceeding a , inside the [ ]. (For the sake of this tutoial, we use head() to only print the first 6 rows)

In [7]:
head(otu_table(phy)[, "S6"])
A otu_table: 6 × 1 of type dbl
S6
dc467f0f8b8aa389aa106d751bb9a5690
c387bc64fb22cd96d2b79dbfa932ce1e0
42a23e6f4764f572f4d7c6d8e08769c30
2ea17744c7eeab459b7f41d4f9e228940
332ef16f5660bfe8ecaabda3404fc08b0
0a39c4f216e370df5d1c10b2ac1f52860

These operations can be combined:

In [8]:
otu_table(phy)["6ec6d03fbef9f16e3581ccdc60e7d266", c("S6", "S144")]
A otu_table: 1 × 2 of type dbl
S6S144
6ec6d03fbef9f16e3581ccdc60e7d2661342511625

sample_data

The sample_data object contains metadata for our samples. We can access it with the sample_data() function. (For the sake of this tutoial, we use head() to only print the first 6 rows).

Note than in contrast to the otu_table, the samples are now rows.

In [9]:
head(sample_data(phy))
A sample_data: 6 × 3
PatientTimeDelivery
<chr><chr><chr>
S18701mVaginal
S2b331mVaginal
S39b11mSectio
S49ba1mVaginal
S57281mVaginal
S69601mSectio

We can subset it in the same way as we did with the otu_table.

In [10]:
sample_data(phy)["S11",]
A sample_data: 1 × 3
PatientTimeDelivery
<chr><chr><chr>
S11a701mVaginal
In [11]:
sample_data(phy)[c("S2", "S150"), c("Patient", "Time")]
A sample_data: 2 × 2
PatientTime
<chr><chr>
S2b331m
S150c611y

tax_table

The tax_table contains the taxonomical annotations of our taxa/ASVs. It can optionally also contain other metadata on our taxa/ASVs. We can access it with the tax_table() function.

Subsetting is done as with the other objects

In [12]:
tax_table(phy)[c("6ec6d03fbef9f16e3581ccdc60e7d266")]
A taxonomyTable: 1 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
6ec6d03fbef9f16e3581ccdc60e7d266BacteriaActinobacteriotaActinobacteriaActinomycetalesBifidobacteriaceaeBifidobacteriumGenus_Bifidobacterium

phy_tree

The phy_tree contains our phylogenetic tree, constructed from an aligment of the 16S rRNA gene sequences of our ASVs. We can access it with the phy_tree() function.

In [13]:
phy_tree(phy)
Phylogenetic tree with 1310 tips and 1309 internal nodes.

Tip labels:
  dc467f0f8b8aa389aa106d751bb9a569, c387bc64fb22cd96d2b79dbfa932ce1e, 42a23e6f4764f572f4d7c6d8e08769c3, 2ea17744c7eeab459b7f41d4f9e22894, 332ef16f5660bfe8ecaabda3404fc08b, 0a39c4f216e370df5d1c10b2ac1f5286, ...
Node labels:
  Root, 0.766, 0.917, 0.895, 0.747, 0.870, ...

Rooted; includes branch lengths.

This prints some basic info about our tree, which we can access with the $ notation

In [14]:
# The 10 first labels:
phy_tree(phy)$tip.label[1:10]
  1. 'dc467f0f8b8aa389aa106d751bb9a569'
  2. 'c387bc64fb22cd96d2b79dbfa932ce1e'
  3. '42a23e6f4764f572f4d7c6d8e08769c3'
  4. '2ea17744c7eeab459b7f41d4f9e22894'
  5. '332ef16f5660bfe8ecaabda3404fc08b'
  6. '0a39c4f216e370df5d1c10b2ac1f5286'
  7. '2928490ca329c46e1649ff8d8c072efc'
  8. 'a805200f08abbfa1a4679a264e851398'
  9. '095f00e04148b0db5e970c32232272f1'
  10. 'bb243e746b385279c86c7c8b365601d3'

and we can plot it (cex sets the size of the labels):

In [15]:
plot(phy_tree(phy), cex = 0.5)

refseq

refseq contains the actual DNA sequences of our ASVs (or alternatively the reference sequences of OTUs). We can access it with the refseq() function.

In [16]:
refseq(phy)
DNAStringSet object of length 1310:
       width seq                                            names               
   [1]   237 TGGCGAGCGTTGTTCGGATTTA...CAGGGACGAAAGCATGGGGAG dc467f0f8b8aa389a...
   [2]   236 GTGCAAGCGTTAATCGGAATTA...GAAGCGCGAAAGCGTGGGTAG c387bc64fb22cd96d...
   [3]   236 TCGCAAGCGTTATCCGGATTTA...AAAGCGCGAAAGCGTGGGTAG 42a23e6f4764f572f...
   [4]   236 TCACAAGCGTTATCCGGATTTA...AAAGCGCGAAAGCGTGGGTAG 2ea17744c7eeab459...
   [5]   236 TCACGAGCGTTATCCGGATTTA...GAAGCGCGAAAGCGTGGGTAG 332ef16f5660bfe8e...
   ...   ... ...
[1306]   237 GGGCAAGCGTTATCCGGATTTA...GAGGCTCGAAAGCGTGGGGAG f43c36b22c1bc8798...
[1307]   237 GGGCCAGCGTTATCCGGATTTA...GAGGCTCGAAAGCGTGGGGAG 7342bde0691e85a0a...
[1308]   237 GGGCAAGCGTTATCCGGATTTA...GAGGCTCGAAAGCGTGGGGAG b2c5f0d55453383ac...
[1309]   237 GGGCAAGCGTTATCCGGATTTA...GAGGCTCGAAAGCGTGGGGAG ab8cf03d7d5eb6d8a...
[1310]   237 GGGCAAGCGTTATCCGGATTTA...GAGGCTCGAAAGCGTGGGGAG e5e51d9940a3c4553...

Again, we can subset with the [ ] notation

In [17]:
refseq(phy)["6ec6d03fbef9f16e3581ccdc60e7d266"]
DNAStringSet object of length 1:
    width seq                                               names               
[1]   237 GTGCAAGCGTTATCCGGAATTAT...CTGAGGAGCGAAAGCGTGGGGAG 6ec6d03fbef9f16e3...

To see the entire sequence, convert it to a string ("character" in R jargon)

In [18]:
as.character(refseq(phy)[c("6ec6d03fbef9f16e3581ccdc60e7d266")])
6ec6d03fbef9f16e3581ccdc60e7d266: 'GTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAG'