Paired-end sequencing - 97% OTU =============================== This tutorial describes a standard micca pipeline for the analysis of overlapping paired-end sequences. The main products of this pipeline are: * the Operational Taxonomic Units (OTUs), defined clustering the processed sequences at a similarity threshold of 97%; * an OTU table, containining the number of times each OTU is observed in each sample; * a taxonomic classification for each OTU; * an OTU phylogenetic tree. This pipeline is intended for different platforms, such as **Illumina MiSeq** and **Illumina HiSeq**. Although this tutorial explains how to apply the pipeline to **16S rRNA** amplicons, it can be adapted to others markers gene/spacers, e.g. **Internal Transcribed Spacer (ITS)**, **18S** or **28S**. .. contents:: Table of Contents :local: Dataset download ---------------- The following paired-end 16S rRNA dataset contains **34 samples** from in FASTQ format (V3-V4 region, 341F 5'-CCTACGGGNGGCWGCAG-3', 806Rmod 5'-GACTACNVGGGTWTCTAATCC-3'). Samples comes from the paper "`Diversity and Cyclical Seasonal Transitions in the Bacterial Community in a Large and Deep Perialpine Lake `_" were seasonal variations in the bacterioplankton community composition in the lake Garda were analized. Sampling was carried out at monthly intervals in three layers representative of the epilimnetic and euphotic zones of the lake, 1, 10, and 20 m. The dataset contains only a subset of the entire study (2014 samples only) and raw data were randomly subsampled at 3000 sequences per sample. The 2x300-bp paired-end sequencing was carried out on an Illumina MiSeq. Open a terminal, download the data and prepare the working directory: .. code-block:: sh wget ftp://ftp.fmach.it/metagenomics/micca/examples/garda.tar.gz tar -zxvf garda.tar.gz cd garda Merge paired-end sequences -------------------------- Overlapping paired-end sequences must be merged to obtain consensus sequences (sometimes called `assembly`). This operation can be performed with the :doc:`/commands/mergepairs` command. .. image:: /images/readpaired.png :align: center :scale: 50% Moreover, the command merges the samples in a single file where sample names are appended to the sequence identifier, as in :doc:`/commands/merge` and :doc:`/commands/split` commands. Since the sequenced region is about of 465-bp (806-341) and the reads are of 300-bp, the overlap region is quite large ((2x300)-465=135 bp), as rule of thumb we set a minimum overlap length of 100 and maximum number of allowed mismatches of about 1/3, say 30: .. code-block:: sh micca mergepairs -i fastq/*_R1*.fastq -o merged.fastq -l 100 -d 30 .. Note:: Starting from micca 1.6.0 staggered read pairs (staggered pairs are pairs where the 3' end of the reverse read has an overhang to the left of the 5’ end of the forward read) will be merged by default. To override this feature (and therefore to discard staggered alignments) set the ``-n/--nostagger`` option. .. Note:: :doc:`/commands/mergepairs` works with FASTQ files only. .. Note:: Reverse file names will be constructed by replacing the string ``_R1`` in the forward file name with ``_R2`` (typical in Illumina file names, see options ``-p/--pattern`` and ``-e/--repl``). Compute reads statistics ^^^^^^^^^^^^^^^^^^^^^^^^ We can report sequences statistics computed on the file ``merged.fastq``. Run the command :doc:`/commands/stats`: .. code-block:: sh micca stats -i merged.fastq -o stats_merged The command reports in 3 text files and in the relative plots (in PNG format) the length distribution, the Q score distribution and a quality summary. The quality summary plot (`stats_merged/stats_qualsumm_plot.png`) is reported below: .. image:: /images/garda_stats_qualsumm_plot.png :align: center :scale: 20% Primer trimming --------------- Segments which match PCR primers should be now removed. For paired-end (already merged) reads, we recommend to **trim both forward and reverse primers and discard reads that do not contain the forward OR the reverse primer**. These operations can be performed with the :doc:`/commands/trim` command: .. code-block:: sh micca trim -i merged.fastq -o trimmed.fastq -w CCTACGGGNGGCWGCAG -r GACTACNVGGGTWTCTAATCC -W -R -c The option ``-W/--duforward`` and ``-R/--dureverse`` ensures that reads that do not contain the forward or the reverse primer will be discarded. With the option ``-c/--searchrc`` the command searches reverse complement primers too. Quality filtering ----------------- Producing high-quality OTUs requires high-quality reads. :doc:`/commands/filter` filters sequences according to the maximum allowed expected error (EE) rate %. We recommend values <=1%. For paired-end reads, we recommend to merge pairs first, then quality filter using a maximum EE threshold with **no length truncation**. .. Warning:: Parameters for the :doc:`/commands/filter` command should be chosen using the tool :doc:`/commands/filterstats`. Choosing parameters for filtering ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The command :doc:`/commands/filterstats` reports the fraction of reads that would pass for each specified maximum expected error (EE) rate %: .. code-block:: sh micca filterstats -i trimmed.fastq -o filterstats Open the PNG file ``filterstats/stats_plot.png``: .. image:: /images/garda_stats_plot.png :align: center :scale: 20% In this case (overlapping paired paired-end reads) we are interested in the plot on top (minimum length filtering only). A minimum read length (L) of **400** and a maximum error rate of **0.75%** seems to be a good compromise between the expected error rate and the number of reads remaining. Inspecting the file ``filterstats/minlen_stats.txt``, you can see that more than **85%** reads will pass the filter:: L 0.25 0.5 0.75 1.0 1.25 1.5 ... 399 63.856 77.766 85.664 90.844 94.484 96.853 400 63.856 77.765 85.661 90.842 94.481 96.850 401 63.842 77.747 85.643 90.822 94.459 96.827 ... .. Note:: To obtain general sequencing statistics, run the micca command :doc:`/commands/stats` on the file ``trimmed.fastq``. Filter sequences ^^^^^^^^^^^^^^^^ Now we can run the :doc:`/commands/filter` command with the selected parameters: .. code-block:: sh micca filter -i trimmed.fastq -o filtered.fasta -e 0.75 -m 400 .. Note:: The maximum number of allowed Ns after truncation can be also specified in :doc:`/commands/filterstats` and in :doc:`/commands/filter`. OTU picking ----------- To characterize the taxonomic structure of the samples, the sequences are now organized into `Operational Taxonomic Units (OTUs) `_ at varying levels of identity. An identity of **97%** represent the common working definition of bacterial species. The :doc:`/commands/otu` command implements several state-of-the-art approaches for OTU clustering, but in this tutorial we will focus on the **de novo greedy clustering** (see :doc:`/otu`): .. code-block:: sh micca otu -m denovo_greedy -i filtered.fasta -o denovo_greedy_otus -t 4 -c The :doc:`/commands/otu` command returns several files in the output directory, including the **SV table** (``otutable.txt``) and a FASTA file containing the **representative sequences** (``otus.fasta``). .. Note:: See :doc:`/otu` to see how to apply the **de novo swarm**, **closed-reference** and the **open-reference** OTU picking strategies to these data. .. _pairedend_97-taxonomy: Assign taxonomy --------------- Now we can assign taxonomy to each representative sequence using the :doc:`/commands/classify` command. In this tutorial we use the RDP (https://doi.org/10.1128/AEM.00062-07) classifier. .. Note:: See :doc:`/install` on how to install the RDP classifier on your system. .. code-block:: sh micca classify -m rdp -i denovo_greedy_otus/otus.fasta -o denovo_greedy_otus/taxa.txt :doc:`/commands/classify` returns a taxonomy file like this:: DENOVO1 Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria DENOVO2 Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family II;Family II;GpIIa DENOVO3 Bacteria;Chloroflexi;Anaerolineae;Anaerolineales;Anaerolineaceae DENOVO4 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Limnohabitans ... .. _pairedend_97-tree: Infer the phylogenetic tree --------------------------- These steps are necessary if you want to use phylogenetic-based metrics such as the UniFrac distance (https://doi.org/10.1128/AEM.01996-06) in the downstream analysis. Multiple Sequence Alignment (MSA) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :doc:`/commands/msa` command provides two approaches for `MSA `_: MUSCLE (https://doi.org/10.1093/nar/gkh340) (de novo alignment) and Nearest Alignment Space Termination (NAST) (https://doi.org/10.1093/nar/gkl244) (which uses a template alignment). In this tutorial we will use the NAST alignment method. For 16S rRNA sequences, a good template alignment is the Greengenes Core Set: .. code-block:: sh wget ftp://ftp.fmach.it/metagenomics/micca/dbs/core_set.tar.gz tar -zxvf core_set.tar.gz At this point we can run the :doc:`/commands/msa` command: .. code-block:: sh micca msa -m nast -i denovo_greedy_otus/otus.fasta -o denovo_greedy_otus/msa.fasta \ --nast-template core_set_aligned.fasta.imputed --nast-threads 4 Build the phylogenetic tree ^^^^^^^^^^^^^^^^^^^^^^^^^^^ At this point we can build the phylogenetic tree from the MSA using :doc:`/commands/tree`: .. code-block:: sh micca tree -i denovo_greedy_otus/msa.fasta -o denovo_greedy_otus/tree.tree .. Note:: The output tree is in `Newick format `_. Midpoint rooting ^^^^^^^^^^^^^^^^ UniFrac metrics require phylogenetic trees to be rooted. The tree can be rooted (in this case at midpoint between the two most distant tips of the tree) using the :doc:`/commands/root` command: .. code-block:: sh micca root -i denovo_greedy_otus/tree.tree -o denovo_greedy_otus/tree_rooted.tree .. Note:: Tree can also be rooted with the outgroup clade containing selected targets, see :doc:`/commands/root`. .. _pairedend_97-biom: Build the BIOM file ------------------- The `Biological Observation Matrix (BIOM) `_ is a common format for representing OTU tables and metadata and is the core data type for downstream analyses in `QIIME `_ and in `phyloseq `_. :doc:`/commands/tobiom` converts the OTU table and the taxonomy table produced by the previous steps to the BIOM format. In addition, the :ref:`formats-sample_data` can be added: .. code-block:: sh micca tobiom -i denovo_greedy_otus/otutable.txt -o denovo_greedy_otus/tables.biom \ -t denovo_greedy_otus/taxa.txt -s sampledata.txt Further steps ------------- * :doc:`/phyloseq` * :doc:`/table`