Before we can cluster our reads into Amplicon Sequence Variants (ASVs) or Operational Taxonomical Units (OTUs) we have to prepare a few things. First, we have to remove the primer sequences from the reads. Second, we have to download the taxonomic database.
Note that the following code is run in shell, not in R. Shell (bash) is a command-line program which is usually run through a terminal, and is the common interface for working with servers and computing clusters.
We have 2 samples as an example dataset in a directory called Seqs. For each sample we have two fastq files, one for the forward reads (_1), and one for the reverse reads (_2).
ls
lists the files:
ls -lh Seqs
Each read is spread across 4 lines in the fastq file.
Let's look at the first read in the SRX2198608_1.fastq file.
head -n4 Seqs/SRX2198605_1.fastq
Quality scores are encoded
First step of preparation is to remove the primers from the sequences, as these give problems in the downstream analysis. Furthermore, if we cannot find the primer sequence in a read it is likely noisy and should be removed. Luckly there is a tool which does all this for us, cutadapt:
First we make a directory where we will put our trimmed reads:
mkdir Seqs/Trimmed
Then we make a text file which contains our sample names:
ls Seqs | grep "\.fastq$" | sed 's/_[1,2].fastq//' | sort | uniq > samples.txt
Which looks like this:
cat samples.txt
The output gives detailed information on how many reads were kept and where it found the primers in the reads. Here we save the output report in a file called cutadapt_SAMPLE.log
for i in $(cat samples.txt)
do cutadapt -g GTGCCAGCMGCCGCGGTAA -G GGACTACHVGGGTWTCTAAT --discard-untrimmed \
-o Seqs/Trimmed/${i}_1.fastq \
-p Seqs/Trimmed/${i}_2.fastq Seqs/${i}_1.fastq Seqs/${i}_2.fastq > cutadapt_${i}.log
done
For assigning a taxonomy to our sequence data later we need to download and prepare a taxonomic database. There are many different ones, but here we will use GTDB. Remember to always check if a newer version has been published.
Download the database:
wget https://zenodo.org/record/3266798/files/GTDB_bac120_arc122_ssu_r89.fa.gz
The database is just a fasta file with sequences and names with the taxonomic assignments (it is compressed, which is why we run gzip -dc
to look at it)
gzip -dc GTDB_bac120_arc122_ssu_r89.fa.gz | head