Prepare reads and database

Before we can cluster our reads into Amplicon Sequence Variants (ASVs) or Operational Taxonomical Units (OTUs) we have to prepare a few things. First, we have to remove the primer sequences from the reads. Second, we have to download the taxonomic database.

Note that the following code is run in shell, not in R. Shell (bash) is a command-line program which is usually run through a terminal, and is the common interface for working with servers and computing clusters.

Look at reads

We have 2 samples as an example dataset in a directory called Seqs. For each sample we have two fastq files, one for the forward reads (_1), and one for the reverse reads (_2).

ls lists the files:

In [1]:
ls -lh Seqs
total 275M

-rw-r--r-- 1 russel mibi 72M Feb 21 00:19 SRX2198605_1.fastq

-rw-r--r-- 1 russel mibi 72M Feb 21 00:19 SRX2198605_2.fastq

-rw-r--r-- 1 russel mibi 32M Feb 21 00:19 SRX2198606_1.fastq

-rw-r--r-- 1 russel mibi 32M Feb 21 00:19 SRX2198606_2.fastq

Each read is spread across 4 lines in the fastq file.

  • First line: Begins with @ thereafter name of read (header for the sequence)
  • Second line: Sequence of the read
  • Third line: Begins with + and thereafter name of read (header for the quality)
  • Fourth line: Quality of read

Let's look at the first read in the SRX2198608_1.fastq file.

In [2]:
head -n4 Seqs/SRX2198605_1.fastq
@SRX2198605.1

GTAGTGCCAGCCGCCGCGGTAATACGTAGGTTCCGAGCGTTATTCGGATTTACTGGGCGTAAAGCGCACGCAGGCGGTTATTTAAGTCTGGTGTTAAAGCCTCTGGCTTAACCTTGGTATTCTTTTTCAGCTTTTTTACTTAGATGCCTTTAGGGAGGGGTAGATTTCCTCGTTTACCGGTGAATTCCTTAGTGTATTGGGGGAATACCGTAGGCGAAGCCGGCCTCTTGGTTATGTCCTGACGTTCCTTT

+SRX2198605.1

A1>AABDDFBAAEEGEECEGGCHFHA1AF/111AA//A/BE//2D00//B1F2D@1//EE?/AG1E//>EE?ECFGGGG/1BB2>BG22B010?/222F1?10111?<F11<F111?<11>111=1101<111111-.00<0<000000000;B.C?-C-;.9B0;09000.....00.---.:0////;//9//////;-;-@--///-9;9-/99>@-----9-;B///:-/;////////-----9//

Quality scores are encoded

Remove primers

First step of preparation is to remove the primers from the sequences, as these give problems in the downstream analysis. Furthermore, if we cannot find the primer sequence in a read it is likely noisy and should be removed. Luckly there is a tool which does all this for us, cutadapt:

First we make a directory where we will put our trimmed reads:

In [3]:
mkdir Seqs/Trimmed

Then we make a text file which contains our sample names:

In [4]:
ls Seqs | grep "\.fastq$" | sed 's/_[1,2].fastq//' | sort | uniq > samples.txt

Which looks like this:

In [5]:
cat samples.txt
SRX2198605

SRX2198606

Loop through sample names and remove primers from the reads for each sample
  • The sequence after -g is the forward primer
  • The sequence after -G is the reverse primer
  • --discard-untrimmed tells the software to remove reads if the primer was not found
  • The rest tells the software where to save the trimmed reads, and where to find the raw reads

The output gives detailed information on how many reads were kept and where it found the primers in the reads. Here we save the output report in a file called cutadapt_SAMPLE.log

In [6]:
for i in $(cat samples.txt)
    do cutadapt -g GTGCCAGCMGCCGCGGTAA -G GGACTACHVGGGTWTCTAAT --discard-untrimmed \
        -o Seqs/Trimmed/${i}_1.fastq \
        -p Seqs/Trimmed/${i}_2.fastq Seqs/${i}_1.fastq Seqs/${i}_2.fastq > cutadapt_${i}.log 
done

Taxonomic database

For assigning a taxonomy to our sequence data later we need to download and prepare a taxonomic database. There are many different ones, but here we will use GTDB. Remember to always check if a newer version has been published.

Download the database:

In [7]:
wget https://zenodo.org/record/3266798/files/GTDB_bac120_arc122_ssu_r89.fa.gz
--2020-02-21 01:21:19--  https://zenodo.org/record/3266798/files/GTDB_bac120_arc122_ssu_r89.fa.gz
Resolving zenodo.org (zenodo.org)... 188.184.95.95
Connecting to zenodo.org (zenodo.org)|188.184.95.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7522880 (7.2M) [application/octet-stream]
Saving to: ‘GTDB_bac120_arc122_ssu_r89.fa.gz’

GTDB_bac120_arc122_ 100%[===================>]   7.17M  19.3MB/s    in 0.4s    

2020-02-21 01:21:20 (19.3 MB/s) - ‘GTDB_bac120_arc122_ssu_r89.fa.gz’ saved [7522880/7522880]

The database is just a fasta file with sequences and names with the taxonomic assignments (it is compressed, which is why we run gzip -dc to look at it)

In [10]:
gzip -dc GTDB_bac120_arc122_ssu_r89.fa.gz | head
>Archaea;Halobacterota;Archaeoglobi;Archaeoglobales;Archaeoglobaceae;Archaeoglobus_C;Archaeoglobus_C_veneficus(RS_GCF_000194625.1)

TACCGGCGGCCCGAGTGGCGGCCACTCTTATTGGGCCTAAAGCGTCCGTAGCCGGTCTGGTAAGTCCCCCGGGAAATCTGGCGGCTTAACCGTCAGACTGCCGGGGGATACTGCCAGACTAGGGACCGGGAGAGGCCGAGGGTATTCCCGGGGTAGGGGTGAAATCCTGTAATCCCGGGAGGACCACCTGTGGCGAAGGCGCTCGGCTGGAACGGGTCCGACGGTGAGGGACGAAGGCCTGGGGAGCGAACCGG

>Archaea;Halobacterota;Halobacteria;Halobacteriales;Haloferacaceae;Halorubrum;Halorubrum_sp002286985(RS_GCF_002286985.1)

TACCGGCAGCCCGAGTGATGGCCGATAATATTGGGCCTAAAGCGTCCGTACCTGGCCGCGCAAGTCCATCCCAAAATCCACCCGCCCAACGGGTGGGCGTCCGGTGGAAACTGCGTGGCTTGGGACCGGAAGGCGCGACGGGTACGTCCGGGGTAGGAGTGAAATCCCGTAATCCTGGACGGACCGCCGATGGCGAAAGCACGTCGCGAGGACGGATCCGACAGTGAGGGACGAAAGCCAGGGTCTCGAACCGG

>Archaea;Halobacterota;Halobacteria;Halobacteriales;Halobacteriaceae;Halobacterium;Halobacterium_hubeiense(RS_GCF_001488575.1)

TACCGGCAGCCCGAGTGATGGCCGATATTATTGGGCCTAAAGCGTCCGTAGCTGGCCGGACAAGTCCGTTGGGAAATCTGTTCGCTTAACGAGCAGGCGTCCAGCGGAAACTGTTCGGCTTGGGACCGGAAGACCTGAGGGGTACGTCCGGGGTAGGAGTGAAATCCTGTAATCCTGGACGGACCACCGGTGGCGAAAGCGCCTCAGGAGGACGGATCCGACAGTGAGGGACGAAAGCTAGGGTCTCGAACCGG

>Archaea;Halobacterota;Halobacteria;Halobacteriales;Natrialbaceae;Halovivax;Halovivax_asiaticus(RS_GCF_000337515.1)

TACCGGCAGCACGAGTGATGACCGCTATTATTGGGCCTAAAGCGTCCGTAGCCGGCCGGGCAAGTCCATCGGGAAATCCGCACGCCTAACGTGCGGGCGTCCGGTGGAAACTGCTTGGCTTGGGACCGGAAGATCCAGAGGGTACGTCTGGGGTAGGAGTGAAATCCTGTAATCCTGGACGGACCACCGGTGGCGAAAGCGCTCTGGAAAGACGGATCCGACGGTGAGGGACGAAAGCTTGGGTCACGAACCGG

>Archaea;Thermoplasmatota;Thermoplasmata;Thermoplasmatales;Thermoplasmataceae;Ferroplasma;Ferroplasma_acidiphilum(RS_GCF_002078355.1)

CACCCGCAGCACGAGTAGTGGTCACTTTTATTGAGCCTAAAGCGTTCGTAGCCGGTTTTGTAAATCTTCAGATAAAGCCTGAAGCTTAACTCCAGAAAGTCTGAAGAGACTGCAAGACTTGAGATCGGGTGAGGTTAAACGTACTTTCAGGGTAGGGGTAAAATCCTGTAATCCCGGAAGGACGACCAGTGGCGAAAGCGTTTAACTAGAACGAATCTGACGGTAAGGAACGAAGGCTAGGGTAGCAAACCGG



gzip: stdout: Broken pipe