.. _quickstart_polya: Quickstart - how to estimate poly(A) tail lengths from nanopore native RNA reads ================================================================================= Owing to homopolymer effects and the proximity to the sequencing adapters, the polyadenylated tails of reads obtained from nanopore native RNA sequencing are improperly basecalled, making their lengths difficult to quantify. We developed the `polya` subprogram to use an alternative statistical model to estimate these tail lengths. In this quickstart tutorial, we'll show you how to estimate polyadenylated tail lengths step-by-step, starting from nothing but raw fast5 files. We'll basecall the fast5 files with Oxford Nanopore Technologies' *albacore* basecaller, before aligning the resulting reads with *minimap2*, indexing the files with *nanopolish index*, and finally segmenting the reads and calling the tail lengths with *nanopolish polya*. We'll be following the steps taken in our `benchmarking analysis `_ workflow that accompanies `our publication `_. In each step below, we'll generate files in the working directory instead of making subdirectories, except in the case of large datasets. Software requirements --------------------- * `nanopolish `_ `>= v10.2` * `minimap2 `_ `>= v2.12` * `samtools `_ `>= v1.9` * `albacore >= v2.3.3` (from Oxford Nanopore's private customer portal) Download raw fast5 data and basecall ------------------------------------ Let's start by downloading a dataset of fast5 files from the European Nucleotide Archive. We'll download a tarball of fast5 files containing reads that are known to have polyadenylated tail lengths of 30nt. :: mkdir data && mkdir data/fastqs wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2764784/30xpolyA.tar.gz -O 30xpolyA.tar.gz && mv 30xpolyA.tar.gz data/ tar -xzf data/30xpolyA.tar.gz -C data/ guppy_basecaller -c rna_r9.4.1_70bps_hac.cfg -i data/30xpolyA/fast5/pass -s data/fastqs cat data/fastqs/*.fastq > data/30xpolyA.fastq In the above, change the value of the `-f` and `-k` arguments based on your flow-cell and sequencing kit, as the basecaller's accuracy is highly dependent upon these settings. Our directory structure should now look something like this: :: (current directory) |- data/ |-- 30xpolyA.fastq |-- 30xpolyA.tar.gz |-- 30xpolyA/ |--- fast5/ |---- pass/ |----- raw_read1.fast5 |----- (... more raw fast5's here ...) |-- fastqs/ |--- workspace/ |---- pass/ Index with nanopolish index --------------------------- Let's construct an index for our reads with nanopolish's `index` subprogram. This constructs a fast lookup table that tells our program where to find the raw fast5 file for each basecalled read. :: nanopolish index --directory=data/30xpolyA/fast5/pass --sequencing-summary=data/fastqs/sequencing_summary.txt data/30xpolyA.fastq This should generate a collection of files in the same directory that contains the `30xpolyA.fastq`. Align with minimap2 and format the BAM file ------------------------------------------- Now we'll align our basecalled mRNA sequences in the fastq file with our reference. First download a reference fasta: :: wget https://raw.githubusercontent.com/paultsw/polya_analysis/master/data/references/enolase_reference.fas wget https://raw.githubusercontent.com/paultsw/polya_analysis/master/data/references/enolase_reference.fas.fai mv enolase_reference.* data/ Note that our directory structure should now look like this: :: (current directory) |- data/ |-- enolase_reference.fas |-- enolase_reference.fai |-- (... same structure as above ...) Let's run `minimap2` to align our basecalled reads to this reference and generate a collection of SAM/BAM files with `samtools`: :: minimap2 -a -x map-ont data/enolase_reference.fas data/30xpolyA.fastq | samtools view -b - -o data/30xpolyA.bam cd data && samtools sort -T tmp -o 30xpolyA.sorted.bam 30xpolyA.bam && samtools index 30xpolyA.sorted.bam && cd .. Note that we used `-x map-ont`, which is typically for unspliced reads (e.g. genomic DNA) coming from nanopore sequencers. In typical native mRNA sequencing, you should use `-x splice`, which is a splice-aware setting (and uses a different gap cost in the alignment). We're using `map-ont` due to the fact that our reads come from a control dataset with no splicing. There should be three more files in the `data` directory now: `30xpolyA.bam`, `30xpolyA.sorted.bam`, and `30xpolyA.sorted.bam.bai`. Segmentation and tail length estimation with nanopolish polya ------------------------------------------------------------- Finally, we can run the polyadenylation estimator: :: nanopolish polya --threads=8 --reads=data/30xpolyA.fastq --bam=data/30xpolyA.sorted.bam --genome=data/enolase_reference.fas > polya_results.tsv Set the `--threads` flag to the number of parallel threads you want to use. Generally speaking, a larger number of threads tends to lower the compute time, but there are diminishing returns to a higher value and performance can actually decrease if your CPU is incapable of supporting your desired number of parallel threads. The best number of threads to use is highly dependent upon your hardware. Interpreting the output TSV file -------------------------------- We'll end this quickstart with a look at the output of the polya program. Let's look at the top five lines of the `polya_results.tsv` file we've just generated: :: head -20 polya_results.tsv | column -t readname contig position leader_start adapter_start polya_start transcript_start read_rate polya_length qc_tag d6f42b79-90c6-4edd-8c8f-8a7ce0ac6ecb YHR174W 0 54.0 3446.0 7216.0 8211.0 130.96 38.22 PASS 453f3f3e-d22f-4d9c-81a6-8576e23390ed YHR174W 0 228.0 5542.0 10298.0 11046.0 130.96 27.48 PASS e02d9858-0c04-4d86-8dba-18a47d9ac005 YHR174W 0 221.0 1812.0 7715.0 8775.0 97.16 29.16 PASS b588dee2-2c5b-410c-91e1-fe8140f4f837 YHR174W 0 22.0 8338.0 13432.0 14294.0 130.96 32.43 PASS af9dfee2-1711-4083-b109-487b99895e0a YHR174W 0 889.0 3679.0 6140.0 7168.0 130.96 39.65 PASS 93f98a86-3b18-48cf-8c4d-15cf277911e2 YHR174W 0 144.0 1464.0 5615.0 6515.0 120.48 30.96 SUFFCLIP af9dfee2-1711-4083-b109-487b99895e0a YHR174W 0 889.0 3679.0 6140.0 7168.0 130.96 39.65 SUFFCLIP 93f98a86-3b18-48cf-8c4d-15cf277911e2 YHR174W 0 144.0 1464.0 5615.0 6515.0 120.48 30.96 PASS ca8d4059-9d82-45ee-aa07-4b8b351618b3 YHR174W 0 1.0 2157.0 4255.0 5862.0 111.56 54.48 PASS 3493c123-78d4-4f7c-add0-cbb249aef00a YHR174W 0 78.0 1938.0 4829.0 5491.0 136.91 25.05 PASS f5ff1802-3fdd-479a-8888-c72de01bbaea YHR174W 0 150.0 3476.0 7233.0 7932.0 130.96 25.35 PASS bb929728-2ed8-42b0-a5a5-eea4bfd62673 YHR174W 0 91.0 1061.0 6241.0 6910.0 111.56 19.74 PASS 17cf3fef-1acb-4045-8252-e9c00fedfb7c YHR174W 0 447.0 6004.0 20058.0 20964.0 100.40 25.17 ADAPTER e3e38de6-8a99-4029-a067-261f470517ca YHR174W 0 41.0 1588.0 4057.0 5303.0 130.96 49.13 PASS 66f55b56-c22e-4e6d-999e-50687bed6fb7 YHR174W 0 191.0 3160.0 9218.0 10030.0 125.50 28.79 PASS 56c116d7-9286-4b57-8329-e74928b11b38 YHR174W 0 13.0 1516.0 5845.0 6773.0 130.96 35.30 PASS 5ca1392c-c48f-4135-85d3-271bd4ee7a13 YHR174W 0 1.0 1976.0 4854.0 5947.0 136.91 44.64 PASS 66b5a0ef-b8e6-475e-bf20-04b96154a67f YHR174W 0 98.0 3847.0 7066.0 7925.0 120.48 29.32 PASS 34bf2187-5816-4744-8e6a-3250b5247e02 YHR174W 0 1.0 2897.0 6885.0 7547.0 125.50 22.54 PASS Each row corresponds to the output for a given read. The columns have the following interpretation: * `readname` refers to the unique ID associated to this particular read. This string is also used to look up the corresponding fast5 file, e.g. by looking through the `readdb` file generated by `nanopolish index`. * `contig` refers to the reference sequence that this read aligns to, and is taken from the BAM file. * `position` is the 5' starting position of the alignment to the reference sequence, and also comes from the BAM file. * `leader_start`, `adapter_start`, `polya_start`, and `transcript_start` are the indices of the signal segmentation generated by the underlying model within `nanopolish`. Briefly, there are four biologically-meaningful regions of the raw sequence of electrical current readings within each fast5 file; these four entries denote the starting index of each of these consecutive regions. The indices start from 0 and are oriented in the 3'-to-5' direction (due to the inherent orientation of the native RNA nanopore sequencing protocol). A full exposition of this segmentation algorithm is available in the `supplementary note`_ to our associated publication. * `read_rate` is the estimated translocation rate (in units of nucleotides/second) of the read through the pore. The translocation rate is non-uniform during the sequencing process of even a single molecule, so this is ultimately a summary statistic of the dynamic, time-varying rate. * `polya_length` is the estimated polyadenylated tail length, in number of nucleotides. That this value is a float rather than an integer reflects the fact that our estimated tail length is the output of an estimator based on the translocation rate. * `qc_tag` is an additional flag used to indicate the validity of the estimate. Generally speaking, you should only use rows of the output file with this value set to `PASS`; all other rows with (e.g.) the `qc_tag` set to `SUFFCLIP`, `ADAPTER`, etc. display signs of irregularity that indicate that we believe the estimate to be unreliable. You can easily filter away all rows with the tag set to anything other than `PASS` using a `grep`: :: grep 'PASS' polya_results.tsv > polya_results.pass_only.tsv