The DADA2 software finds Amplicon Sequence Variants (ASVs) from our raw reads. It does this by seeking to remove the errors produced by the sequencing platform.
This notebook describes the entire DADA2 workflow, and includes construction of a phylogenetic tree, and putting it all together in a phyloseq object.
library(dada2)
library(ips)
library(phyloseq)
Path to reads:
# Where are the (primer-trimmed) sequences
path <- "Seqs/Trimmed"
# Check that the files are there
list.files(path)
Split names in forward and reverse reads. fnFs and fnRs now contains the names of the forward and reverse reads, respectively
fnFs <- sort(list.files(path, pattern="_1.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2.fastq", full.names = TRUE))
Get sample names:
# Get sample names
sample.names <- sapply(strsplit(basename(fnFs), "_"), "[", 1)
sample.names
First step is to remove bad quality reads, and possibly truncate reads to remove bad quality endings
Inspect read quality (of the first sample only). First plot is forward reads, second plot is reverse reads. The cycle is equivalent to the basepair position in the read. The green line is the median (the red lines other quantiles), and the backround a heatmap (darker means more reads). We can see that the quality of the forward reads drops at around 230bp (which is expected given that our raw reads were 250 bp and we removed 19 bp of primer sequence), but for the reverse reads it's around 150bp (reverse reads are always lower quality). There also appears to be a sub-population of reads which have low quality (below 20)
plotQualityProfile(fnFs[1])
plotQualityProfile(fnRs[1])
Filter and trim reads
There are many ways to filter and trim reads. See ?filterAndTrim for details.
Here we use to following parameters:
# Make names for where to put filtered reads
filtFs <- file.path(path, "Filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "Filtered", paste0(sample.names, "_R_filt.fastq.gz"))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
maxEE=2,
truncLen=c(200,150),
trimLeft=6,
rm.phix=TRUE)
# How many reads were removed
out
When choosing filtering and trimming parameters (especially truncation) always have the length of the sequenced region in mind; there should be enough length left to merge the forward and reverse read. For example, for these samples the 16S rRNA gene V4 region primers 515F and 806R were used. This region spans 291 bp (in E. coli), which is 252 bp after removing primers. Our reads have a combined length of 350 bp (after truncation, trimLeft should be ignored). This gives and overlap region of 98 bp. If we have sequenced the V3-V4 region using the 341F primer instead of the 515F, we would have no overlap left with these truncation parameters, which would give problems in the merging step.
DADA2 will try to learn the error rates from the reads, such that errors can be corrected
errF <- learnErrors(filtFs)
errR <- learnErrors(filtRs)
Plot Error profiles (of forward reads). Each panel is a subtitution error, for example A2C, is the error by which an A is converted to a C. The red lines are the expected error rates. The red points and line are observed. The expected and observed should follow each other somewhat.
plotErrors(errF, nominalQ=TRUE)
First step in inferring ASVs is to dereplicate, which means to remove duplicates
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
We can then infer the ASVs from the dereplicated reads, and the error rates
dadaFs <- dada(derepFs, err=errF)
dadaRs <- dada(derepRs, err=errR)
We will then merge the forward and reverse reads. Remember that the minimum overlap should be in accordance with the filtering parameters used. See section on filtering above
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap=20)
Convert to a sequence table:
seqtab <- makeSequenceTable(mergers)
Dimensions of table. First number is samples, second number is ASVs
dim(seqtab)
Chimeras are the accidental merging of sequences during sequencing. One part of the read could come from one bacterium, and the other part from another bacterium. It is possible to remove most of these bioinformatically. As we can assume that chimeras are relatively rare, we can remove rare ASVs whose sequence seems to be a mix of two common ASVs.
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", verbose=TRUE)
Check how many ASVs are left after chimera removal
dim(seqtab.nochim)
It is important to check how many reads that have been removed by the different steps. If a single step removes many more reads than the others, something might be wrong in the parameters of the different steps.
How many reads are left after each step:
getN <- function(x) sum(getUniques(x))
track <- cbind(out,
sapply(dadaFs, getN),
sapply(dadaRs, getN),
sapply(mergers, getN),
rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF",
"denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
track
Convert to percentages:
track[, 1:6] /track[, 1] * 100
It is clearly the filtering step which removes most. We might have been too harsh in removing low quality reads. However, the two samples are also different, and possibly the second sample is just of worse quality.
Some rules of thump:
Let's assign taxonomy to our ASVs
taxa <- assignTaxonomy(seqtab.nochim, "GTDB_bac120_arc122_ssu_r89.fa.gz", tryRC = TRUE)
Let's view check the taxonomy of first three ASVs. The taxonomy is to the right in the table.
taxa
Note that the ASVs are named by their sequence. It is easier if the ASVs have shorter names, and the sequences are therefore usually run through a hash algorithm to give them unique shorter reproducible names.
# Extract sequences (to save them)
seqs <- Biostrings::DNAStringSet(getSequences(seqtab.nochim))
# New names
asv_new <- openssl::md5(getSequences(seqtab.nochim))
# Put in the new names for all objects
colnames(seqtab.nochim) <- asv_new
names(seqs) <- asv_new
rownames(taxa) <- asv_new
Look at taxonomy table again:
taxa[1:3, ]
Let's make a phylogenetic tree. First we have to align the sequences. Then, based on this alignment we can infer our tree.
Align sequences with mafft. mafft is a multiple alignment software. It is run outside of R, but we can use the mafft function from the ips package to run it from R.
seq.dna <- as.DNAbin(seqs)
align <- mafft(seq.dna, exec = "mafft-win/mafft.bat")
Look at alignment. Each row is an ASV sequence, each column is a position in the alignment. The length of the alignment is expected to be around the length of the sequenced 16S rRNA gene region.
image(align)
6 of the ASVs clearly have weird sequences which do not align with the others. This is a good step to identify false ASVs.
You can check for false ASVs by checking:
Count ASVs with different sequence lengths:
table(nchar(as.character(seqs)))
6(!) are much shorter (194 bp) than the rest. What are their names:
names(seqs)[nchar(as.character(seqs)) < 200]
Are there some ASVs with unknown Phylum?:
taxa[is.na(taxa[, "Phylum"]), ]
Exactly the same 6 ASVs which were short also have unknown Phylum annotation. These are probably false.
Let's remove the 6 bad ASVs and redo the alignment:
# Get names of bad ASVs
bad_asvs <- names(seqs)[nchar(as.character(seqs)) < 200]
# Remove from objects
seqtab.nochim <- seqtab.nochim[, !colnames(seqtab.nochim) %in% bad_asvs]
seqs <- seqs[!names(seqs) %in% bad_asvs]
taxa <- taxa[!rownames(taxa) %in% bad_asvs, ]
# Redo alignment
seq.dna <- as.DNAbin(seqs)
align <- mafft(seq.dna, exec = "mafft-win/mafft.bat")
# Plot alignment
image(align)
Much better!
Let's create a phylogenetic tree with FastTree2
# We save the alignment as a fasta file
write.fas(align, file = "alignment.fasta")
# FastTree is also an external software that we run from within R
system2("FastTree.exe", args = c("-nt", "-out tree.nwk", "alignment.fasta"))
# Read the tree that FastTree made and plot it
# cex controls is size of labels in the plot
tree <- read.tree("tree.nwk")
plot(tree, cex = 0.5)
We should always root the tree. For this we need an outgroup. If we had Archaea in our samples, it would be a good choice. Here we use the Campylobacter phylum.
campy_asvs <- rownames(taxa)[taxa[, "Phylum"] == "Campylobacterota"]
# Root
tree <- root(tree, outgroup=campy_asvs, resolve.root=TRUE)
# Color by phylum
cols <- c("red","blue","green","purple","gold","black","cyan","grey")[factor(taxa[tree$tip.label, 2])]
# Plot
plot(tree, cex = 0.5, tip.color = cols)
Let's put all together in a phyloseq object
First, we need one more thing, the sample metadata:
# Read sample data
# Here we have information for the different samples
samp <- read.table("sampleData.csv", sep = ";", header = TRUE, row.names = 1)
print(samp)
If the taxonomic classifier cannot assign a certain taxonomic level it will write NA in the taxonomy table. NAs are a problem in the later analysis, expecially for tax agglomoration, and we therefore want to replace NAs with the most specific known taxonomic level.
taxa <- t(apply(taxa, 1, function(x)
if(sum(is.na(x))>0){
c(x[!is.na(x)],
paste(colnames(taxa)[max(which(!is.na(x)))], rep(x[max(which(!is.na(x)))], sum(is.na(x))), sep="_"))}
else{x}))
Now we have a final taxonomic table:
taxa
Put it all together
# Collect everything in a phyloseq object
phy <- phyloseq(otu_table(t(seqtab.nochim), taxa_are_rows = TRUE),
sample_data(samp),
tax_table(taxa),
phy_tree(tree),
seqs)
# Check what the phyloseq object contains
phy
Save it for use in other scripts
# Save phyloseq object for later analysis
save(phy, file = "physeq_test.RData")
# Save everything
#save.image("dada2.RData")