DADA2

The DADA2 software finds Amplicon Sequence Variants (ASVs) from our raw reads. It does this by seeking to remove the errors produced by the sequencing platform.

This notebook describes the entire DADA2 workflow, and includes construction of a phylogenetic tree, and putting it all together in a phyloseq object.

Workflow

Load packages

In [1]:
library(dada2)
library(ips)
library(phyloseq)
Indlæser krævet pakke: Rcpp

Indlæser krævet pakke: ape


Vedhæfter pakke: 'ips'


Det følgende objekt er maskeret fra 'package:dada2':

    rc


Reads and sample names

Path to reads:

In [2]:
# Where are the (primer-trimmed) sequences
path <- "Seqs/Trimmed"

# Check that the files are there
list.files(path)
  1. 'Filtered'
  2. 'SRX2198605_1.fastq'
  3. 'SRX2198605_2.fastq'
  4. 'SRX2198606_1.fastq'
  5. 'SRX2198606_2.fastq'

Split names in forward and reverse reads. fnFs and fnRs now contains the names of the forward and reverse reads, respectively

In [3]:
fnFs <- sort(list.files(path, pattern="_1.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2.fastq", full.names = TRUE))

Get sample names:

In [4]:
# Get sample names
sample.names <- sapply(strsplit(basename(fnFs), "_"), "[", 1)
sample.names
  1. 'SRX2198605'
  2. 'SRX2198606'

Filter and trim

First step is to remove bad quality reads, and possibly truncate reads to remove bad quality endings

Inspect read quality (of the first sample only). First plot is forward reads, second plot is reverse reads. The cycle is equivalent to the basepair position in the read. The green line is the median (the red lines other quantiles), and the backround a heatmap (darker means more reads). We can see that the quality of the forward reads drops at around 230bp (which is expected given that our raw reads were 250 bp and we removed 19 bp of primer sequence), but for the reverse reads it's around 150bp (reverse reads are always lower quality). There also appears to be a sub-population of reads which have low quality (below 20)

In [5]:
plotQualityProfile(fnFs[1])
plotQualityProfile(fnRs[1])
Warning message:
"`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead."
Warning message:
"`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead."

Filter and trim reads

There are many ways to filter and trim reads. See ?filterAndTrim for details.

Here we use to following parameters:

  • Maximum expected error (maxEE) should be 2. I.e. reads with an overall low quality are removed.
  • Truncate forward reads to 200 bp. Reads shorter than 200 bp are removed, the rest are truncated to 200 bp
  • and truncate reverse reads to 150 bp. Reads shorter than 150 bp are removed, the rest are truncated to 150 bp
  • Trim 6 bp from the left of (5') of each read. The 5' ends of reads in 16S rRNA gene amplicon sequences are highly conserved. Removing some of it usually improves chimera detection.
  • Remove sequences that match phiX genome. In Illumina sequencing we always spike with phiX dna (to improve sequence quality), some of this phiX sequence might spillover to actual samples.
In [6]:
# Make names for where to put filtered reads
filtFs <- file.path(path, "Filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "Filtered", paste0(sample.names, "_R_filt.fastq.gz"))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
                     maxEE=2,
                     truncLen=c(200,150),
                     trimLeft=6,
                     rm.phix=TRUE)
In [7]:
# How many reads were removed
out
A matrix: 2 × 2 of type dbl
reads.inreads.out
SRX2198605_1.fastq134449115160
SRX2198606_1.fastq 58532 18660

Choosing parameters

When choosing filtering and trimming parameters (especially truncation) always have the length of the sequenced region in mind; there should be enough length left to merge the forward and reverse read. For example, for these samples the 16S rRNA gene V4 region primers 515F and 806R were used. This region spans 291 bp (in E. coli), which is 252 bp after removing primers. Our reads have a combined length of 350 bp (after truncation, trimLeft should be ignored). This gives and overlap region of 98 bp. If we have sequenced the V3-V4 region using the 341F primer instead of the 515F, we would have no overlap left with these truncation parameters, which would give problems in the merging step.

Learn errors

DADA2 will try to learn the error rates from the reads, such that errors can be corrected

In [8]:
errF <- learnErrors(filtFs)
errR <- learnErrors(filtRs)
25961080 total bases in 133820 reads from 2 samples will be used for learning the error rates.
19270080 total bases in 133820 reads from 2 samples will be used for learning the error rates.

Plot Error profiles (of forward reads). Each panel is a subtitution error, for example A2C, is the error by which an A is converted to a C. The red lines are the expected error rates. The red points and line are observed. The expected and observed should follow each other somewhat.

In [9]:
plotErrors(errF, nominalQ=TRUE)
Warning message:
"Transformation introduced infinite values in continuous y-axis"
Warning message:
"Transformation introduced infinite values in continuous y-axis"

ASV inference

First step in inferring ASVs is to dereplicate, which means to remove duplicates

In [10]:
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)

# Name the derep objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
Dereplicating sequence entries in Fastq file: Seqs/Trimmed/Filtered/SRX2198605_F_filt.fastq.gz

Encountered 14044 unique sequences from 115160 total sequences read.

Dereplicating sequence entries in Fastq file: Seqs/Trimmed/Filtered/SRX2198606_F_filt.fastq.gz

Encountered 7064 unique sequences from 18660 total sequences read.

Dereplicating sequence entries in Fastq file: Seqs/Trimmed/Filtered/SRX2198605_R_filt.fastq.gz

Encountered 9183 unique sequences from 115160 total sequences read.

Dereplicating sequence entries in Fastq file: Seqs/Trimmed/Filtered/SRX2198606_R_filt.fastq.gz

Encountered 1663 unique sequences from 18660 total sequences read.

We can then infer the ASVs from the dereplicated reads, and the error rates

In [11]:
dadaFs <- dada(derepFs, err=errF)
dadaRs <- dada(derepRs, err=errR)
Sample 1 - 115160 reads in 14044 unique sequences.
Sample 2 - 18660 reads in 7064 unique sequences.
Sample 1 - 115160 reads in 9183 unique sequences.
Sample 2 - 18660 reads in 1663 unique sequences.

Merge reads

We will then merge the forward and reverse reads. Remember that the minimum overlap should be in accordance with the filtering parameters used. See section on filtering above

In [12]:
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap=20)

Convert to a sequence table:

In [13]:
seqtab <- makeSequenceTable(mergers)

Dimensions of table. First number is samples, second number is ASVs

In [14]:
dim(seqtab)
  1. 2
  2. 239

Chimera removal

Chimeras are the accidental merging of sequences during sequencing. One part of the read could come from one bacterium, and the other part from another bacterium. It is possible to remove most of these bioinformatically. As we can assume that chimeras are relatively rare, we can remove rare ASVs whose sequence seems to be a mix of two common ASVs.

In [15]:
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", verbose=TRUE)
Identified 176 bimeras out of 239 input sequences.

Check how many ASVs are left after chimera removal

In [16]:
dim(seqtab.nochim)
  1. 2
  2. 63

Checking dada2 workflow

It is important to check how many reads that have been removed by the different steps. If a single step removes many more reads than the others, something might be wrong in the parameters of the different steps.

How many reads are left after each step:

In [17]:
getN <- function(x) sum(getUniques(x))
track <- cbind(out, 
               sapply(dadaFs, getN), 
               sapply(dadaRs, getN), 
               sapply(mergers, getN), 
               rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", 
                     "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
track
A matrix: 2 × 6 of type dbl
inputfiltereddenoisedFdenoisedRmergednonchim
SRX2198605134449115160114971114902110786105691
SRX2198606 58532 18660 18647 18638 18595 18335

Convert to percentages:

In [18]:
track[, 1:6] /track[, 1] * 100
A matrix: 2 × 6 of type dbl
inputfiltereddenoisedFdenoisedRmergednonchim
SRX219860510085.653385.5127285.4614082.4000278.61048
SRX219860610031.880031.8577931.8424131.7689531.32475

It is clearly the filtering step which removes most. We might have been too harsh in removing low quality reads. However, the two samples are also different, and possibly the second sample is just of worse quality.

Some rules of thump:

  • Filtering step removes many reads: maxEE could be too high, or truncation too high.
  • Denoising step removes many reads: Something is up with the error rates.
  • Merging step removes many reads: minOverlap too high, or truncation too high.
  • Chimera step removes many reads: trimLeft could be increased.

Taxonomy

Let's assign taxonomy to our ASVs

In [19]:
taxa <- assignTaxonomy(seqtab.nochim, "GTDB_bac120_arc122_ssu_r89.fa.gz", tryRC = TRUE)

Let's view check the taxonomy of first three ASVs. The taxonomy is to the right in the table.

In [20]:
taxa
A matrix: 63 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
GGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA
GGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCABacteriaFirmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA
GGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGCCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
GGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCABacteriaActinobacteriotaActinobacteria Corynebacteriales CorynebacteriaceaeCorynebacteriumCorynebacterium_pseudodiphtheriticum(RS_GCF_000477935.1)
GGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTTGTCGCGTCTGCTGTGAAAGGCCGGGGCTTAACTCCGTGTATTGCAGTGGGTACGGGCAGACTAGAGTGCAGTAGGGGAGACTGGAACTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAAGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGAAGCGAAAGCATGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa_A(RS_GCF_001809565.1)
GGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGBacteriaFirmicutes Bacilli Lactobacillales Carnobacteriaceae Alloiococcus Alloiococcus_otitis(RS_GCF_000315445.1)
GGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTGACTTAAGTGAGGTGTGAAAGCCCCGGGCTTAACCTGGGAATTGCATTTCATACTGGGTCGCTAGAGTACTTTAGGGAGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGAAGGCGAAGGCAGCCCCTTGGGAATGTACTGACGCTCATGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Haemophilus_D NA
GGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGCAGACGGTTACTTAAGCAGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTCTGAACTGGGTGACTAGAGTGTGTCAGAGGGAGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCTCCTGGGATAACACTGACGTTCATGCTCGAAAGCGTGGGTAGCABacteriaProteobacteria GammaproteobacteriaBetaproteobacterialesNeisseriaceae Neisseria Neisseria_flavescens(RS_GCF_002847985.1)
GGTCCCGAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA
GGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGAGTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA
GGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCTGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGBacteriaActinobacteriotaActinobacteria Corynebacteriales CorynebacteriaceaeCorynebacteriumNA
GGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
GGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTTATTTAAGTGAGGTGTGAAAGCCCCGGGCTTAACCTGGGAATTGCATTTCAGACTGGGTAACTAGAGTACTTTAGGGAGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGAAGGCGAAGGCAGCCCCTTGGGAATGTACTGACGCTCATGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Haemophilus NA
GGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGCCTATCCAGTCTGTCTTAAAAGTTCGGGGCTCAACCCCGTGATGGGATGGAAACTAGTAGGCTAGAGTATCGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGGCTTTCTGGACGAAAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCGBacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae F0422 F0422_sp1(RS_GCF_000221605.1)
GGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAGAATAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGBacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus NA
GGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGCCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCABacteriaFirmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA
GGTGGCAAGCGTTGTCCAGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGCCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
GGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia NA
GGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGGGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGCCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCABacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
TAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGAAACCCCAGTAGTCCGTTCTGTCTCTBacteriaNA NA NA NA NA NA
TGTCACAAGCGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAGGTGGTTATGTAAGTCTGATGTGAAAATGCAGGGCTCAACTCTGTATTGCGTTGGAAACTGTGTAACTAGAGTACTGGAGAGGTAAGCGGAACTACAAGTGTAGAGGTGAAATTCGTAGATATTTGTAGGAATGCCGATGGGGAAGCCAGCTTACTGGACAGATACTGACGCTGAAGCGCGAAAGCGTGGGTAGCABacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium (RS_GCF_000524395.1)
GGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGCTATTTAAGTGAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCATTTCAGACTGGGTAGCTAGAGTACTTTAGGGAGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGAAGGCGAAGGCAGCCCCTTGGGAATGTACTGACGCTCATGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Aggregatibacter(RS_GCF_000466335.1)
GGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGATTGGTCAGTCTGTCTTAAAAGTTCGGGGCTTAACCCCGTGATGGGATGGAAACTGCCAATCTAGAGTATCGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGACTTTCTGGACGAAAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCGBacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Veillonella_dispar(RS_GCF_000160015.1)
GGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGATTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCABacteriaActinobacteriotaActinobacteria Corynebacteriales CorynebacteriaceaeCorynebacteriumCorynebacterium_pseudodiphtheriticum(RS_GCF_000477935.1)
TGTCACAAGCGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAGGTGGTTATATAAGTCTGATGTGAAAATGCAGGGCTCAACTCTGTATTGCGTTGGAAACTGTGTAACTAGAGTACTGGAGAGGTAAGCGGAACTACAAGTGTAGAGGTGAAATTCGTAGATATTTGTAGGAATGCCGATGGGGAAGCCAGCTTACTGGACAGATACTGACGCTGAAGCGCGAAAGCGTGGGTAGCABacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium (RS_GCF_000524395.1)
GGTGGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGGGAGCGTAGGCGGATTTTTAAGTGGGATGTGAAATACCCGGGCTCAACCTGGGTGCTGCATTCCAAACTGGAAATCTAGAGTGCAGGAGGGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAAGAACACCAGTGGCGAAGGCGACTTTCTGGACTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCABacteriaFirmicutes_A Clostridia Clostridiales Clostridiaceae Clostridium Clostridium_paraputrificum(RS_GCF_001679805.1)
GGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTTAGAAAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTAGGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGATCABacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA
GGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTATTTAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACTGCATCTGATACTGGATAACTAGAGTAGGTGAGAGGGGAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTCCCTGGCATCATACTGACACTGAGGTGCGAAAGCGTGGGTAGCABacteriaProteobacteria GammaproteobacteriaPseudomonadales Moraxellaceae Moraxella Moraxella_catarrhalis(RS_GCF_000092265.1)
TAACCCAAGTCAATAGAAACCGGCATAAAGAGTGTTTTAGATCAATTCCCCTCAATAAAGCTAAAATTCACGTGAGTTGTAAAAAACTCCAGTTGATACAAAATAAGCTACGAAAGTGGCTTTAATGCATCTGAACACAGAATAACTAAGACCCAAACTGGGATTAGAAACCCCAGTAGTCCGTTCTGTCTCTTBacteriaNA NA NA NA NA NA
TGTCACAAGCGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAGGTGGTTATATAAGTCTGATGTGAAAATGCAGGGCTCAACTCTGTATTGCGTTGGAAACTGTGTAACTAGAGTACTGGAGAGGTAAGCGGAACTACAAGTGTAGAGGTGAAATTCGTAGATATTTGTAGGAATGCCGATGGGGAAGCCAGCTTACTGGACAAATACTGACGCTGAAGCGCGAAAGCGTGGGTAGCABacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium NA
GGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTCAGCGGTGAAATCTAGAGGCTCAACCTCGAAACTGCCGTTGAAACTGGCGAACTTGAGTGTAGATGAGGTAGGCGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATCACGAGGAACTCCGATTGCGAAGGCAGCTTACTAAGGTACAACTGACGCTGAAGCACGAAAGCGTGGGTATCABacteriaBacteroidota Bacteroidia Bacteroidales PorphyromonadaceaePorphyromonas NA
GGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGCAGACGGTTACTTAAGCAGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTCTGAACTGGGTGACTAGAGTGTGTCAGAGGGAGATAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCTCCTGGGATAACACTGACGTTCATGCTCGAAAGCGTGGGTAGCABacteriaProteobacteria GammaproteobacteriaBetaproteobacterialesNeisseriaceae Neisseria Neisseria_flavescens(RS_GCF_002847985.1)
TAACCCAAGTCAATAGAAACCGGCATAAAGAGTGTTTTAGATCAATTCCCCTCAATAAAGCTAAAATTCACGTGAGTTGTAAAAAACTCCAGTTGATACAAAATAAGCTACGAAAGTGGCTTTAATGCATCTGAACACAGAATAACTAAGACCCAAACTGGGATTAGATACCCTTGTAGTCCGTTCTGTCTCTTBacteriaNA NA NA NA NA NA
GGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGCAGACGGTTACTTAAGCAGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTCTGAACTGGGTGACTAGAGTGTGTCAGAGGGAGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCTCCTGGGATAACACTGACGTTCATGCCCGAAAGCGTGGGTAGCABacteriaProteobacteria GammaproteobacteriaBetaproteobacterialesNeisseriaceae Morococcus NA
GGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTTGTCGCGTCTGCTGTGAAAGGCCGGAGCTTAACTCCGTGTATTGCAGTGGGTACGGGCAGACTAGAGTGCAGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGAAGCGAAAGCATGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa(GB_GCA_002861015.1)
GGGTGCGAGCGTTGTCCAGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCTGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGBacteriaActinobacteriotaActinobacteria Corynebacteriales CorynebacteriaceaeCorynebacterium NA
GGGAGCGAGCGTTATCCGGATTTATTGGGTGTAAAGGGTGCGTAGACGGAAATGCAAGTTAGTTGTGAAATACCTCGGCTTAACTGAGGAACTGCAACTAAAACTATATTTCTTGAGTATCGGAGGGGTTTGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCGGTGGCGAAGGCGACAAACTGGACGATAACTGACGTTGAGGCACGAAAGTGTGGGGAGCABacteriaFirmicutes_A Clostridia NA NA NA NA
TGTCACGAGCGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAGGTGGTTATGTAAGTCTGATGTGAAAATGCAGGGCTCAACTCTGTATTGCGTTGGAAACTGTATAACTAGAGTACTGGAGAGGTAAGCGGAACTACAAGTGTAGAGGTGAAATTCGTAGATATTTGTAGGAATGCCGATGGGGAAGCCAGCTTACTGGACAGATACTGACGCTAAAGCGCGAAAGCGTGGGTAGCABacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium NA
TAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCCGGTAGTCCGTTCTGTCTCTBacteriaNA NA NA NA NA NA
GGGTGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTCCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCCTGCAGCTTAACTGTAGAATTGCCATTGATACTGCTAGTCTTGAGTGTATTTGAAGTAGCTGGAATAAGTAGTGTAGCGGTGAAATGCATAGATATTACTTAGAACACCAATTGCGAAGGCAGGTTACTAAGATACAACTGACGCTGATGGACGAAAGCGTGGGGAGCGBacteriaBacteroidota Bacteroidia Flavobacteriales Weeksellaceae ChryseobacteriumNA
GGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGACTTTTAAGTGAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCATTTCAGACTGGGAGTCTAGAGTACTTTAGGGAGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGAAGGCGAAGGCAGCCCCTTGGGAATGTACTGACGCTCATGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Mannheimia NA
TAACCCAAGTCAATAGAAACCGGCATAAAGAGTGTTTTAGATCAATTCCCCTCAATAAAGCTAAAATTCACGTGAGTTGTAAAAAACTCCAGTTGATACAAAATAAGCTACGAAAGTGGCTTTAATGCATCTGAACACAGAATAACTAAGACCCAAACTGGGATTAGATACCCCTGTAGTCCGTTCTGTCTCTTBacteriaNA NA NA NA NA NA
GGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCCATTCAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATGGCTTGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria AlphaproteobacteriaSphingomonadales Sphingomonadaceae Blastomonas Blastomonas_donghaensis(RS_GCF_001556315.1)
AGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGGGATTAAGTGTGTTGTGAAATGTAGGCGCCCAACGTCTGACTTGCAGCGCATACTGGTTCCCTTGAGTACGCGCAACGCCGGCGGAATTCGTCGTGTAGCGGTGAAATGCTTAGATATGACGAAGAACCCCGATTGCGAAGGCAGCCGGCGGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_fusca(RS_GCF_000614245.1)
GGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGTAGACGGTTACATAAGTCGGGTGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTCGAGACTGCGTAGCTAGGGTGCGGAAGAGGGAAGCGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAAGCGGCTTCCTGGTCCAGCACCGACGTTCAGGCACGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaNA NA NA NA
AGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCCGTTTGATAAGCGTGCTGTGAAATATAGTGGCTCAACCTCTATCGTGCAGCGCGAACTGTCGAACTTGAGTGCGTAGTAGGTAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACCGTAACGTTACTGACGCTTAAGCACGAAGGTGCGGGTATCGBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella (RS_GCF_000163055.2)
GGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGACCGGAAAGTTGGGGGTGAAATCCCGGGGCTCAACCTCGGAACTGCCTTCAAAACTACTGGTCTGGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGAGACTGACGCTGAGGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria AlphaproteobacteriaRhodobacterales Rhodobacteraceae Paracoccus NA
TAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCGAGTAGTCCGTTCTGTCTCTBacteriaNA NA NA NA NA NA
GGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGCAAGACAGAGGTGAAATCCCCGGGCTCAACCTGGGAACTGCCTTTGTGACTGCATGGCTAGAGTACGGTAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGACCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaBetaproteobacterialesBurkholderiaceae Paucibacter NA
GGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTGATAAGTCTGAAGTTAAAGGCTGTGGCTCAACCATAGTTCGCTTTGGAAACTGTCAAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCGBacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA
GGGTGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCTGGTGTGAAAGCTTACTGCTTAACGGTAGGTTGCGCTGGATACGGGCAGGCTTGAGTGCAGTAGGGGAGACTGGAATTCTCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCTATGGCGAAGGCAGGTCTCTGGGCTGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales BifidobacteriaceaeScardovia Scardovia_wiggsiae(RS_GCF_000275805.1)
GGATGCGAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGTAGGCGGGCTAATAAGTCAGAGGTGAAAGCGCTCAGCTCAACTGAGCAACTGCCTTTGAAACTGTTAGTCTTGAATGGTTGTGAAGTAGTTGGAATGTGTAGTGTAGCGGTGAAATGCTTAGATATTACACAGAACACCGATAGCGAAGGCATATTACTAACAATTAATTGACGCTGATGGACGAAAGCGTGGGGAGCGBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Paraprevotella Paraprevotella_clara(RS_GCF_000233955.1)
AGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCCTTTTAAGCGTGTTGTGAAATTCAGTTGCTCAACATCTGACTTGCAGCGCGAACTGGGGGGCTTGAGTATGCTGAAGGTAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGCAGCTTAACTGACGCTGATGCTCGAAAGCGTGGGTATCGBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_conceptionensis(RS_GCF_000312305.1)
GGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGTTAGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTCAAAACTGGCTAACTAGAGTATGTGAGAGGGGGGTAGAATTCCAAGTGTAGCGGTGAAATGCGTAGAGATTTGGAGGAATACCAGTGGCGAAGGCGGCCCCCTGGCACAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria GammaproteobacteriaEnterobacterales Alteromonadaceae Rheinheimera (RS_GCF_001513785.1)
GGGTGCAAGCGTTACTCGGAATCACTGGGCGTAAAGGACGCGTAGGCGGATTATCAAGTCTCTTGTGAAATCCTATGGCTTAACCATAGAACTGCTTGGGAAACTGATAATCTAGAGTGAGGGAGAGGCAGATGGAATTGGTGGTGTAGGGGTAAAATCCGTAGAGATCACCAGGAATACCCATTGCGAAGGCGATCTGCTGGAACTCAACTGACGCTAATGCGTGAAAGCGTGGGGAGCABacteriaCampylobacterotaCampylobacteria Campylobacterales CampylobacteraceaeCampylobacter_A Campylobacter_A_concisus_A(RS_GCF_000017725.2)
GGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTTAGAGGTGAAAGCCTGGAGCTCAACTCCAGAACTGCCTTTAAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCABacteriaProteobacteria AlphaproteobacteriaSphingomonadales Sphingomonadaceae Sphingomonas NA
GGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales BifidobacteriaceaeBifidobacterium NA
GGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTCGCGTCTAGCGTTTAAGGCTCGGGCTTAACCCGGGTTTGCGTTGGGTACGGGCAGGCTTGAGTGCGGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTTACTGGGCCGTTACTGACGCTGAGGAGCGAGAGCGTGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales Actinomycetaceae NA NA
GGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTTGTCGCGTCTGCTGTGAAAGGCCGGAGCTTAACTCCGTGTATTGCAGTGGGTACGGGCAGACTAGAGTGCAGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa(GB_GCA_002861015.1)
GGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGCAACTGACACTGATGCTCGAAAGTGTGGGTATCABacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_xylanisolvens(RS_GCF_000577955.1)

Note that the ASVs are named by their sequence. It is easier if the ASVs have shorter names, and the sequences are therefore usually run through a hash algorithm to give them unique shorter reproducible names.

In [21]:
# Extract sequences (to save them)
seqs <- Biostrings::DNAStringSet(getSequences(seqtab.nochim))
In [22]:
# New names
asv_new <- openssl::md5(getSequences(seqtab.nochim))
In [23]:
# Put in the new names for all objects
colnames(seqtab.nochim) <- asv_new
names(seqs) <- asv_new
rownames(taxa) <- asv_new

Look at taxonomy table again:

In [24]:
taxa[1:3, ]
A matrix: 3 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
59220fa2a44ac51f2cb31b68a2e2ca90BacteriaFirmicutesBacilliLactobacillales Streptococcaceae Streptococcus NA
a9e6fcda282cc26c0f9df33b908dcc17BacteriaFirmicutesBacilliStaphylococcalesStaphylococcaceaeStaphylococcusNA
86da44252df24bfdd187eeb0b0839bbcBacteriaFirmicutesBacilliStaphylococcalesGemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)

Phylogenetic tree

Let's make a phylogenetic tree. First we have to align the sequences. Then, based on this alignment we can infer our tree.

Align sequences with mafft. mafft is a multiple alignment software. It is run outside of R, but we can use the mafft function from the ips package to run it from R.

In [25]:
seq.dna <- as.DNAbin(seqs)
align <- mafft(seq.dna, exec = "mafft-win/mafft.bat")

Look at alignment. Each row is an ASV sequence, each column is a position in the alignment. The length of the alignment is expected to be around the length of the sequenced 16S rRNA gene region.

In [26]:
image(align)

6 of the ASVs clearly have weird sequences which do not align with the others. This is a good step to identify false ASVs.

You can check for false ASVs by checking:

  • Sequence length: Are some surprisingly short or long
  • Taxonomy: Is it likely that you have an ASV which comes from an unknown Kingdom? Or unknown Phylum?
  • Phylogenetic tree outliers: Is the sequences just very different from all the other 16S rRNA gene sequences

Count ASVs with different sequence lengths:

In [27]:
table(nchar(as.character(seqs)))
194 240 241 242 
  6  14  38   5 

6(!) are much shorter (194 bp) than the rest. What are their names:

In [28]:
names(seqs)[nchar(as.character(seqs)) < 200]
  1. '058413938b264e2113f57e1e9dad7d05'
  2. 'ef698664814b0cd6fa8e071b070b5df7'
  3. 'ca9373151510889c3f7590f93e6b957f'
  4. '6de1ee5e57a6f85761a46cbbe7e77894'
  5. '5cd2256e4a92f40b8bea27530254a372'
  6. 'acf6b55ddce92c35ca86c34d8bb525e4'

Are there some ASVs with unknown Phylum?:

In [29]:
taxa[is.na(taxa[, "Phylum"]), ]
A matrix: 6 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
058413938b264e2113f57e1e9dad7d05BacteriaNANANANANANA
ef698664814b0cd6fa8e071b070b5df7BacteriaNANANANANANA
ca9373151510889c3f7590f93e6b957fBacteriaNANANANANANA
6de1ee5e57a6f85761a46cbbe7e77894BacteriaNANANANANANA
5cd2256e4a92f40b8bea27530254a372BacteriaNANANANANANA
acf6b55ddce92c35ca86c34d8bb525e4BacteriaNANANANANANA

Exactly the same 6 ASVs which were short also have unknown Phylum annotation. These are probably false.

Let's remove the 6 bad ASVs and redo the alignment:

In [30]:
# Get names of bad ASVs
bad_asvs <- names(seqs)[nchar(as.character(seqs)) < 200]

# Remove from objects
seqtab.nochim <- seqtab.nochim[, !colnames(seqtab.nochim) %in% bad_asvs]
seqs <- seqs[!names(seqs) %in% bad_asvs]
taxa <- taxa[!rownames(taxa) %in% bad_asvs, ]
In [31]:
# Redo alignment
seq.dna <- as.DNAbin(seqs)
align <- mafft(seq.dna, exec = "mafft-win/mafft.bat")
In [32]:
# Plot alignment
image(align)

Much better!

Let's create a phylogenetic tree with FastTree2

In [33]:
# We save the alignment as a fasta file
write.fas(align, file = "alignment.fasta")
In [34]:
# FastTree is also an external software that we run from within R
system2("FastTree.exe", args = c("-nt", "-out tree.nwk", "alignment.fasta"))
In [35]:
# Read the tree that FastTree made and plot it
# cex controls is size of labels in the plot
tree <- read.tree("tree.nwk")
plot(tree, cex = 0.5)

Root tree

We should always root the tree. For this we need an outgroup. If we had Archaea in our samples, it would be a good choice. Here we use the Campylobacter phylum.

In [36]:
campy_asvs <- rownames(taxa)[taxa[, "Phylum"] == "Campylobacterota"]
In [37]:
# Root
tree <- root(tree, outgroup=campy_asvs, resolve.root=TRUE)
In [38]:
# Color by phylum
cols <- c("red","blue","green","purple","gold","black","cyan","grey")[factor(taxa[tree$tip.label, 2])]
In [39]:
# Plot
plot(tree, cex = 0.5, tip.color = cols)

Make phyloseq object

Let's put all together in a phyloseq object

First, we need one more thing, the sample metadata:

In [40]:
# Read sample data
# Here we have information for the different samples
samp <- read.table("sampleData.csv", sep = ";", header = TRUE, row.names = 1)
print(samp)
            Type Patient
SRX2198605 Trach       A
SRX2198606 Trach       B

Fix taxonomy table

If the taxonomic classifier cannot assign a certain taxonomic level it will write NA in the taxonomy table. NAs are a problem in the later analysis, expecially for tax agglomoration, and we therefore want to replace NAs with the most specific known taxonomic level.

In [41]:
taxa <- t(apply(taxa, 1, function(x) 
    if(sum(is.na(x))>0){
        c(x[!is.na(x)], 
          paste(colnames(taxa)[max(which(!is.na(x)))], rep(x[max(which(!is.na(x)))], sum(is.na(x))), sep="_"))} 
    else{x}))

Now we have a final taxonomic table:

In [42]:
taxa
A matrix: 57 × 7 of type chr
59220fa2a44ac51f2cb31b68a2e2ca90BacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
a9e6fcda282cc26c0f9df33b908dcc17BacteriaFirmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus Genus_Staphylococcus
86da44252df24bfdd187eeb0b0839bbcBacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
116baf6accde73ae581dd5712be3ea54BacteriaActinobacteriotaActinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium Corynebacterium_pseudodiphtheriticum(RS_GCF_000477935.1)
c8ddf38e1bd6da7b0195674843f3e5caBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa_A(RS_GCF_001809565.1)
8e07fe4b6e4e60417bc851c2bfb26aceBacteriaFirmicutes Bacilli Lactobacillales Carnobacteriaceae Alloiococcus Alloiococcus_otitis(RS_GCF_000315445.1)
863d2f1f3be1881e4d81e0c8f25cdb3cBacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Haemophilus_D Genus_Haemophilus_D
5e414eb0c58c2713959cdd8389e2365bBacteriaProteobacteria GammaproteobacteriaBetaproteobacteriales Neisseriaceae Neisseria Neisseria_flavescens(RS_GCF_002847985.1)
76153252d53e23f38245371c203c8fb0BacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
65ad7f40898432867ad5c89c4d00eacdBacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
f6d7af2d233c03589964e073f7ef4db9BacteriaActinobacteriotaActinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium Genus_Corynebacterium
ed70899ac215395026a1c3682cb32da6BacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
6a41d9bd1a74554f390f368d656ffdfcBacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Haemophilus Genus_Haemophilus
c4db60525fdf82e14915da21b6e5708eBacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae F0422 F0422_sp1(RS_GCF_000221605.1)
635e927075bb00379a690569bdcf833bBacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus Genus_Lactobacillus
f2cc0a573c2f0c83dd38af6de05a7b24BacteriaFirmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus Genus_Staphylococcus
b06b16c1b752f4b54cf5415de45733e9BacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
7779089c2297b45efc17daaef1562eaeBacteriaProteobacteria GammaproteobacteriaEnterobacterales Enterobacteriaceae Escherichia Genus_Escherichia
b81674dd710957a9892b7d88c73cc3c4BacteriaFirmicutes Bacilli Staphylococcales Gemellaceae Gemella_A Gemella_A_asaccharolytica(RS_GCF_001553005.1)
97fb36c45da15e52e8eefd3c67e6335cBacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium (RS_GCF_000524395.1)
af6d1b0fc47087d746f5170b4dfcaa46BacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Aggregatibacter (RS_GCF_000466335.1)
60831b2e8e37bdec3874f46b9949d27bBacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Veillonella_dispar(RS_GCF_000160015.1)
b8a65a71566f4da4100f0c4a9b47ee24BacteriaActinobacteriotaActinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium Corynebacterium_pseudodiphtheriticum(RS_GCF_000477935.1)
b90f71389fce291287f46f3ca2ef0747BacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium (RS_GCF_000524395.1)
a8d2b371dd0b05611fe3e3e827b4d1d6BacteriaFirmicutes_A Clostridia Clostridiales Clostridiaceae Clostridium Clostridium_paraputrificum(RS_GCF_001679805.1)
f028255e7ec2a25c8ab91b8e1e6d5febBacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
068459b847683936637173828c8a8335BacteriaProteobacteria GammaproteobacteriaPseudomonadales Moraxellaceae Moraxella Moraxella_catarrhalis(RS_GCF_000092265.1)
9a570609b25b41c6b45f1e162321dec9BacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium Genus_Fusobacterium
1c9f84095d51d1304c779f26ad67a076BacteriaFirmicutes Bacilli Lactobacillales Aerococcaceae Granulicatella Granulicatella_elegans(RS_GCF_000162475.2)
8826e4107a5e521678483364e9c897b6BacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
be4a4d8de8e1293a420544443962f6cfBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Micrococcus Micrococcus_luteus(RS_GCF_000023205.1)
883e949248f928b91fa17e7bfbf88e5eBacteriaBacteroidota Bacteroidia Bacteroidales Porphyromonadaceae Porphyromonas Genus_Porphyromonas
8b229e09e83ac74ff9979d18f1fed6f0BacteriaProteobacteria GammaproteobacteriaBetaproteobacteriales Neisseriaceae Neisseria Neisseria_flavescens(RS_GCF_002847985.1)
9c27d2d75560598c816a2c8f1373cb66BacteriaProteobacteria GammaproteobacteriaBetaproteobacteriales Neisseriaceae Morococcus Genus_Morococcus
a3ba533e5c624d750c52b6434098f3d9BacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa(GB_GCA_002861015.1)
41a67de3226a694b8afed03b76c8ddc5BacteriaActinobacteriotaActinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium Genus_Corynebacterium
5b4de3d47bac1b400e865c5ef0c91ce8BacteriaFirmicutes_A Clostridia Class_Clostridia Class_Clostridia Class_Clostridia Class_Clostridia
deb7b05edd27bb0738a74a466da225b1BacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium Genus_Fusobacterium
a8f1e84066c6cd3198b2da2f99c18d46BacteriaBacteroidota Bacteroidia Flavobacteriales Weeksellaceae Chryseobacterium Genus_Chryseobacterium
f35ad08d437446b5ec4d54b8fce1963fBacteriaProteobacteria GammaproteobacteriaEnterobacterales Pasteurellaceae Mannheimia Genus_Mannheimia
9628394c60a37190e13a58e5bc5d619dBacteriaProteobacteria AlphaproteobacteriaSphingomonadales Sphingomonadaceae Blastomonas Blastomonas_donghaensis(RS_GCF_001556315.1)
c10698841b8254071a980bb1dd2e7f1cBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_fusca(RS_GCF_000614245.1)
07dc6d4120e600331c45d5c303db9cf2BacteriaProteobacteria GammaproteobacteriaClass_GammaproteobacteriaClass_GammaproteobacteriaClass_GammaproteobacteriaClass_Gammaproteobacteria
7f053e967026f8d4d193ef56570df5d7BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella (RS_GCF_000163055.2)
e3d14ad7a536b525b59ce834747724b4BacteriaProteobacteria AlphaproteobacteriaRhodobacterales Rhodobacteraceae Paracoccus Genus_Paracoccus
d558cc48cfdadce2de07e6528e8f9decBacteriaProteobacteria GammaproteobacteriaBetaproteobacteriales Burkholderiaceae Paucibacter Genus_Paucibacter
62c0535f3185014432800dcc769deea9BacteriaFirmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Genus_Streptococcus
04bf641b9edbf6aff94573e26c98a671BacteriaActinobacteriotaActinobacteria Actinomycetales Bifidobacteriaceae Scardovia Scardovia_wiggsiae(RS_GCF_000275805.1)
52aed0212e3860b37fafd2326330cbfcBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Paraprevotella Paraprevotella_clara(RS_GCF_000233955.1)
793647161405b0c55a674e2a4b8b283bBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_conceptionensis(RS_GCF_000312305.1)
237ab206b54232ee0939b84cbe43640aBacteriaProteobacteria GammaproteobacteriaEnterobacterales Alteromonadaceae Rheinheimera (RS_GCF_001513785.1)
53cac6eb1e12feb79af42bd416242835BacteriaCampylobacterotaCampylobacteria Campylobacterales Campylobacteraceae Campylobacter_A Campylobacter_A_concisus_A(RS_GCF_000017725.2)
8b3a6e686233573da5eed0db44f2fdddBacteriaProteobacteria AlphaproteobacteriaSphingomonadales Sphingomonadaceae Sphingomonas Genus_Sphingomonas
1dba26afd5b4b5c8c41147bb753dd538BacteriaActinobacteriotaActinobacteria Actinomycetales Bifidobacteriaceae Bifidobacterium Genus_Bifidobacterium
0327fc569806f72ddff578e5a7ab08b4BacteriaActinobacteriotaActinobacteria Actinomycetales Actinomycetaceae Family_Actinomycetaceae Family_Actinomycetaceae
487a1c8878a3890f993689934fbfdf8eBacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa(GB_GCA_002861015.1)
73e9ed99e67d102c0e483db9998c7958BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_xylanisolvens(RS_GCF_000577955.1)

Put it all together

In [43]:
# Collect everything in a phyloseq object
phy <- phyloseq(otu_table(t(seqtab.nochim), taxa_are_rows = TRUE),
                sample_data(samp),
                tax_table(taxa),
                phy_tree(tree),
                seqs)

# Check what the phyloseq object contains
phy
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 57 taxa and 2 samples ]
sample_data() Sample Data:       [ 2 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 57 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 57 tips and 56 internal nodes ]
refseq()      DNAStringSet:      [ 57 reference sequences ]

Save it for use in other scripts

In [44]:
# Save phyloseq object for later analysis
save(phy, file = "physeq_test.RData")

# Save everything
#save.image("dada2.RData")