Dependencies | Installation and execution | Usage | Notes | Example | References | Citations
nanodna (nanopore de novo assembly) is a command line tool written in Bash to ease the de novo assembly of prokaryote genomes from high-quality nanopore sequencing reads (e.g. Q20 > 80%).
Every pre- and post-processing step is managed by nanodna (e.g. read filtering, de novo assembly, assembled sequence enhancement). Its main purpose is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 10 to 30 minutes using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.
nanodna runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.
| program | package | version | sources |
|---|---|---|---|
| AlienDiscover | - | ≥ 0.3 | gitlab.pasteur.fr/GIPhy/AlienDiscover |
| AlienTrimmer | - | ≥ 3.1 | gitlab.pasteur.fr/GIPhy/AlienTrimmer |
| CoPro | - | ≥ 0.3 | gitlab.pasteur.fr/GIPhy/CoPro |
| dnaapler | - | ≥ 1.3.0 | github.com/gbouras13/dnaapler |
| Flye | - | ≥ 2.9.6 | github.com/mikolmogorov/Flye |
| FQsum | - | ≥ 0.5 | gitlab.pasteur.fr/GIPhy/FQsum |
| minidna | - | ≥ 25.12 | gitlab.pasteur.fr/GIPhy/minidna |
| minimap2 | - | ≥ 2.30 | github.com/lh3/minimap2 |
| ntCard | - | ≥ 1.2.2 | github.com/bcgsc/ntCard |
| ROCK | - | ≥ 3.0 | gitlab.pasteur.fr/vlegrand/ROCK |
| program | package | version | sources |
|---|---|---|---|
| bzip2 | - | > 1.0.0 | sourceware.org/bzip2/downloads.html |
| DSRC | - | ≥ 2.0 | github.com/refresh-bio/DSRC |
| gzip | - | > 1.5.0 | ftp.gnu.org/gnu/gzip |
| pigz | - | > 2.7 | github.com/madler/pigz |
| program | package | version | sources |
|---|---|---|---|
| cat cp du echo mkdir mktemp mv rm paste sort tr wc |
coreutils★ | > 8.0 | ftp.gnu.org/gnu/coreutils |
| gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
| grep | - | > 2.0 | ftp.gnu.org/gnu/grep |
| sed | - | > 4.2 | ftp.gnu.org/gnu/sed |
★ On OS X, the package coreutils can be installed
via Homebrew
(e.g. brew install coreutils).
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/nanodna.gitB. Go to the created directory and give the execute
permission to the file nanodna.sh:
cd nanodna/
chmod +x nanodna.shC. Check the dependencies (and their version) using the following command line:
./nanodna.sh -d$PATH variable (or if one compiled binary has a different
default name), it should be manually specified. To specify the location
of a specific binary, edit the file nanodna.sh and indicate
the local path to the corresponding binary(ies) within the code block
DEPENDENCIES (approximately lines 110-180). For each
required program, the table below reports the corresponding variable
assignment instruction to edit (if needed) within the code block
REQUIREMENTS.
| program | variable assignment | program | variable assignment | |
|---|---|---|---|---|
| AlienDiscover | ALIENDISCOVER_BIN=AlienDiscover; |
gawk | GAWK_BIN=gawk; |
|
| AlienTrimmer | ALIENTRIMMER_BIN=AlienTrimmer; |
gzip★ | GZIP_BIN=gzip; |
|
| bzip2★ | BZIP2_BIN=bzip2; |
minidna | MINIDNA_BIN=minidna; |
|
| CoPro | COPRO_BIN=CoPro; |
minimap2 | MINIMAP2_BIN=minimap2; |
|
| dnaapler | DNAAPLER_BIN=dnaapler; |
ntcard | NTCARD_BIN=ntcard; |
|
| DSRC★ | DSRC_BIN=dsrc; |
pigz★ | PIGZ_BIN=pigz; |
|
| Flye | FLYE_BIN=flye; |
ROCK | ROCK_BIN=rock; |
|
| FQsum | FQSUM_BIN=FQsum; |
★ Optional programs are not required to be installed/accessible for nanodna to run.
Note that depending on the installation of some required programs,
the corresponding variable can be assigned with complex commands. For
example, as FQsum is a Java tool that can be (quickly) run
using a Java virtual machine, the executable jar file
FQsum.jar can be used by nanodna by editing the
corresponding variable assignment instruction as follows:
FQSUM_BIN="java -jar $path_to/FQsum.jar".
D. Execute nanodna with the following command line model:
./nanodna.sh [options]Run nanodna without option to read the following documentation:
Processing and assembling high-quality nanopore sequencing reads (Q20>80%)
https://gitlab.pasteur.fr/GIPhy/nanodna
USAGE: nanodna [options]
OPTIONS:
-i <infile> FASTA/FASTQ input file name(s); multiple files can be specified, separated by
commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
or bz2), or DSRC (dsrc or dsrc2)
-o <outdir> name of the output directory (mandatory option)
-b <string> base name for output files (mandatory option)
-n already trimmed input data (default: not set)
-c <int> digital normalization coverage depth (default: 80)
-f force (slower) deterministic assembly
-t <int> thread numbers (default: 12)
-w <dir> path to the tmp directory (default: $TMPDIR, otherwise /tmp)
-x to not remove (after completing) the tmp directory (default: not set)
-d checks dependencies and exit
-h prints this help and exit
nanodna can consider up to 15 input files (mandatory
option -i). Input files should be in FASTA (file extensions
.fa or .fasta) or FASTQ format (file extensions .fq or fastq). FASTQ
files can also be compressed using several tools, provided that the file
extension matches the expected one (i.e. bzip2: .bz or .bz2; DSRC: .dsrc or
.dsrc2; gzip:
.gz) and that the corresponding binary is installed.
Read clipping/trimming, as well as quality- and length-based
filtering, are automatically performed using AlienTrimmer
(Criscuolo & Brisse 2013). However, it is recommended to set the
option -n when technical oligonucleotides were already
clipped.
One important step of nanodna is the digital
normalization of the filtered reads. It selects the longest and the
least erroneous reads using ROCK
(Legrand et al. 2022) in order to build a subset that ensures a reduced
and uniform coverage depth across every position of the sequenced
genome. By default, the expected coverage depth is set to 80× to obtain
a good trade-off between high accuracy and short running times; however,
this default coverage depth can be modified using option
-c.
A genome assembly is next inferred using Flye (Lin et
al. 2016, Kolmogorov et al. 2019). This initial assembly is then
processed to (i) cut small circular sequences that are fully duplicated
within a single contig, (ii) reorient contigs using dnaapler
(Bouras et al. 2024), and (iii) polish the modified contigs using Flye. In
parallel, small circular contigs (i.e. < 8000 bps; if any) are
independently assembled using minidna.
These different steps are finally followed by the computation of the
genome coverage profile using CoPro from minimap2 alignments
(Li 2018, 2021) in order to assess the overall accuracy of the assembled
sequences and to detect their weaknesses (e.g. under-covered regions,
multi-allelic positions). The required running times of every step can
be reduced by using multiple threads (option -t; default:
12) or by setting a temporary directory (option -w) located
on a fast solid-state drive (SSD).
Output files are all defined by the same specified prefix
(mandatory option -b) and written into a specified output
directory (mandatory option -o). Each output file content
is determined by its file extension:
| output file | file content |
|---|---|
| <prefix>.aliens.fasta | technical/artefactual oligonucleotides inferred using AlienDiscover in FASTA format |
| <prefix>.0.fasta | initial genome assembly inferred using Flye in FASTA format |
| <prefix>.1.cov.info.txt | genome coverage profile estimated by CoPro from minimap2 read alignments |
| <prefix>.1.fasta | enhanced genome assembly in FASTA format |
| <prefix>.1.seq.info.txt | descriptive statistics of the sequence(s) in <prefix>.1.fasta |
| <prefix>.1.amb.info.txt | ambigous (multi-allelic) positions in <prefix>.1.fasta with approximated base compositions |
In order to illustrate the usefulness of nanodna and to better describe its output files, the following example describes its usage for assembling the genome of Escherichia coli isolate I (Lerminiaux et al. 2024).
All output files are available in the directory example/ (genome sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).
Long-read sequencing of this genome was performed using a MinION apparatus, and the resulting (compressed) FASTQ file (750 Mb) can be downloaded using the following command lines:
EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;Below is a typical command line to run nanodna (default: 12
threads), by specifying the input file (option -i), the
output directory (option -o) and the output file name
prefix (option -b):
nanodna -i SRR26162837.fastq.gz -o example -b Ecoli.I -nDifferent log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using nanodna together with program and tool versions listed into example/program.versions.txt (note that different outputs and statistics can be obtained as the de novo assembly tool Flye is not fully deterministic when used on multiple threads):
# nanodna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/nanodna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=zz
+ TMPDIR=/tmp/nanodna.Ecoli.I.7HhnkcU7F
[00:00] reading input file ... [ok]
+ [748.4 Mb] SRR26162837.fastq.gz
[00:09] approximating genome size ... [ok]
> 5026618 bps
[00:21] input file processing ... [ok]
+ Nreads=111542 Nbases=795978646 N90=3121 %Q20=80.73 cov=158x
[00:37] data filtering ...... [ok]
+ Nreads=110689 Nbases=787409761 N90=3122 %Q20=81.17 cov=156x
[01:32] digital normalization ...... [ok]
+ Nreads=36298 Nbases=492923768 N90=7774 %Q20=82.22 cov=98x
[03:50] checking small plasmid(s) ....... [ok]
> Nseq=4 Nres=12212 Ncirc=4 %GC=49.4
[04:24] de novo assembly ... [ok]
> Nseq=3 Nres=5043921 Ncirc=3 %GC=50.8
[16:35] assembly processing ...... [ok]
> Nseq=7 Nres=5056122 Namb=213 %weak=3.78 accuracy=0.7724
[21:58] output files:
+ genome assembly (initial): zz/Ecoli.I.0.fasta
+ coverage profile (enhanced): zz/Ecoli.I.1.cov.info.txt
+ genome assembly (enhanced): zz/Ecoli.I.1.fasta
+ sequence stats (enhanced): zz/Ecoli.I.1.seq.info.txt
+ ambiguous positions: zz/Ecoli.I.1.amb.info.txt
[22:04] exit
This log output shows that the complete analysis was performed on 12
threads in less than 25 minutes. Based on the approximated genome length
(i.e. 5.02 Mbps), the initial raw read set corresponds to an average
coverage depth of 158×. The filtering and digital normalization steps
led to a reduced subset of 36,298 high-quality long reads, while
increasing the Q20 metric from 80.73% to 82.22% and the N90 one from
3,121 to 7,774. Of note, as this datafile was considered as already
trimmed (option -n), the inferred alien sequences are only
low residue content ones (see
example/Ecoli.I.aliens.fasta).
The dedicated assembly of small circular contigs (i.e. < 8000 bps) led to four plasmid sequences in ~1 minute (minidna), whereas the other three large contigs were assembled in ~12 minutes (Flye). Each of the seven assembled sequences are circular, with respective lengths comparable to the expected results, i.e. 4,914,387, 76,261, 53,262, 5,167, 4,074, 1,538 and 1,433 bps for SEQ0001-SEQ0007, respectively (see example/Ecoli.I.1.seq.info.txt), whereas the (hybrid) public assembly consists of one chromosome (NZ_CP135458; 4,914,339 bps) and six plasmids: NZ_CP135459 (76,262 bps), NZ_CP135460 (53,262 bps), NZ_CP135461 (5,167 bps), NZ_CP135462 (4,074 bps), NZ_CP135463 (1,538 bps) and NZ_CP135464 (1,433 bps). Reoriented SEQ0001 starts with a dnaA gene showing that this contig corresponds to a chromosome, whereas SEQ0002-SEQ0007 start with repA genes, assessing that they each correspond to a plasmid.
The assembly contains 213 multi-allelic positions (mainly in the chromosome SEQ0001; see example/Ecoli.I.1.amb.info.txt) and 3.78% of “weak” positions, i.e. assembled positions that are significantly undercovered (i.e. < 70×) by high quality sequenced bases (written in lower case in example/Ecoli.I.1.fasta). It is worth noting that the genome coverage profile is slightly negatively skewed (see example/Ecoli.I.1.cov.info.txt), confirming that some regions of the genome are quite undercovered by the sequencing data. This yields an overall accuracy of 0.7724 (obtained by multiplying the momo and R2 quality metrics provided by the tool CoPro; see example/Ecoli.I.1.cov.info.txt), which remains very acceptable.
Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968
Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011
Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8
Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.
Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705
Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113