.. _quickstart_call_methylation: Quickstart - calling methylation with nanopolish ===================================================== Oxford Nanopore sequencers are sensitive to base modifications. Here we provide a step-by-step tutorial to help you get started with detecting base modifications using nanopolish. **For more information about our approach:** * Simpson, Jared T., et al. `"Detecting DNA cytosine methylation using nanopore sequencing." `_ Nature Methods (2017). **Requirements**: * `nanopolish v0.8.4 `_ * `samtools v1.2 `_ * `minimap2 `_ Download example dataset ------------------------------------ In this tutorial we will use a subset of the `NA12878 WGS Consortium data `_. You can download the example dataset we will use here (warning: the file is about 2GB): :: wget http://s3.climb.ac.uk/nanopolish_tutorial/methylation_example.tar.gz tar -xvf methylation_example.tar.gz cd methylation_example **Details**: * Sample : Human cell line (NA12878) * Basecaller : Guppy v2.3.5 * Region: chr20:5,000,000-10,000,000 In the extracted example data you should find the following files: * ``albacore_output.fastq`` : the subset of the basecalled reads * ``reference.fasta`` : the chromsome 20 reference sequence * ``fast5_files/`` : a directory containing signal-level FAST5 files The reads were basecalled using this albacore command: :: guppy_basecaller -c dna_r9.4.1_450bps_flipflop.cfg -i fast5_files/ -s basecalled/ After the basecaller finished, we merged all of the fastq files together into a single file: :: cat basecalled/*.fastq > output.fastq Data preprocessing ------------------------------------ nanopolish needs access to the signal-level data measured by the nanopore sequencer. To begin, we need to create an index file that links read ids with their signal-level data in the FAST5 files: :: nanopolish index -d fast5_files/ output.fastq We get the following files: ``output.fastq.index``, ``output.fastq.index.fai``, ``output.fastq.fastq.index.gzi``, and ``output.fastq.index.readdb``. Aligning reads to the reference genome -------------------------------------- Next, we need to align the basecalled reads to the reference genome. We use minimap2 as it is fast enough to map reads to the human genome. In this example we'll pipe the output directly into ``samtools sort`` to get a sorted bam file: :: minimap2 -a -x map-ont reference.fasta output.fastq | samtools sort -T tmp -o output.sorted.bam samtools index output.sorted.bam Calling methylation ------------------- Now we're ready to use nanopolish to detect methylated bases (in this case 5-methylcytosine in a CpG context). The command is fairly straightforward - we have to tell it what reads to use (``output.fastq``), where the alignments are (``output.sorted.bam``), the reference genome (``reference.fasta``) and what region of the genome we're interested in (``chr20:5,000,000-10,000,000``):: nanopolish call-methylation -t 8 -r output.fastq -b output.sorted.bam -g reference.fasta -w "chr20:5,000,000-10,000,000" > methylation_calls.tsv The output file contains a lot of information including the position of the CG dinucleotide on the reference genome, the ID of the read that was used to make the call, and the log-likelihood ratio calculated by our model: :: chromosome start end read_name log_lik_ratio log_lik_methylated log_lik_unmethylated num_calling_strands num_cpgs sequence chr20 4980553 4980553 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 3.70 -167.47 -171.17 1 1 TGAGACGGGGT chr20 4980599 4980599 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 2.64 -98.87 -101.51 1 1 AATCTCGGCTC chr20 4980616 4980616 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a -0.61 -95.35 -94.75 1 1 ACCTCCGCCTC chr20 4980690 4980690 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a -2.99 -99.58 -96.59 1 1 ACACCCGGCTA chr20 4980780 4980780 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 5.27 -135.45 -140.72 1 1 CACCTCGGCCT chr20 4980807 4980807 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a -2.95 -89.20 -86.26 1 1 ATTACCGGTGT chr20 4980820 4980822 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 7.47 -90.63 -98.10 1 2 GCCACCGCGCCCA chr20 4980899 4980901 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 3.17 -96.40 -99.57 1 2 GTATACGCGTTCC chr20 4980955 4980955 c1e202f4-e8f9-4eb8-b9a6-d79e6fab1e9a 0.33 -92.14 -92.47 1 1 AGTCCCGATAT A positive value in the ``log_lik_ratio`` column indicates support for methylation. We have provided a helper script that can be used to calculate how often each reference position was methylated: :: scripts/calculate_methylation_frequency.py methylation_calls.tsv > methylation_frequency.tsv The output is another tab-separated file, this time summarized by genomic position: :: chromosome start end num_cpgs_in_group called_sites called_sites_methylated methylated_frequency group_sequence chr20 5036763 5036763 1 21 20 0.952 split-group chr20 5036770 5036770 1 21 20 0.952 split-group chr20 5036780 5036780 1 21 20 0.952 split-group chr20 5037173 5037173 1 13 5 0.385 AAGGACGTTAT In the example data set we have also included bisulfite data from ENCODE for the same region of chromosome 20. We can use the included ``compare_methylation.py`` helper script to do a quick comparison between the nanopolish methylation output and bisulfite: :: python3 compare_methylation.py bisulfite.ENCFF835NTC.example.tsv methylation_frequency.tsv > bisulfite_vs_nanopolish.tsv We can use R to visualize the results - we observe good correlation between the nanopolish methylation calls and bisulfite: :: library(ggplot2) library(RColorBrewer) data <- read.table("bisulfite_vs_nanopolish.tsv", header=T) # Set color palette for 2D heatmap rf <- colorRampPalette(rev(brewer.pal(11,'Spectral'))) r <- rf(32) c <- cor(data$frequency_1, data$frequency_2) title <- sprintf("N = %d r = %.3f", nrow(data), c) ggplot(data, aes(frequency_1, frequency_2)) + geom_bin2d(bins=25) + scale_fill_gradientn(colors=r, trans="log10") + xlab("Bisulfite Methylation Frequency") + ylab("Nanopolish Methylation Frequency") + theme_bw(base_size=20) + ggtitle(title) Here's what the output should look like: .. figure:: _static/quickstart_methylation_results.png :scale: 80% :alt: quickstart_methylation_results