nanodna

nanodna (nanopore de novo assembly) is a command line tool written in Bash to ease the de novo assembly of prokaryote genomes from high-quality nanopore sequencing reads (e.g. Q20 > 80%).

Every pre- and post-processing step is managed by nanodna (e.g. read filtering, de novo assembly, assembled sequence enhancement). Its main purpose is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 10 to 30 minutes using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.

nanodna runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.

Mandatory programs

program	package	version	sources
AlienDiscover	-	≥ 0.3	gitlab.pasteur.fr/GIPhy/AlienDiscover
AlienTrimmer	-	≥ 3.1	gitlab.pasteur.fr/GIPhy/AlienTrimmer
CoPro	-	≥ 0.3	gitlab.pasteur.fr/GIPhy/CoPro
dnaapler	-	≥ 1.3.0	github.com/gbouras13/dnaapler
Flye	-	≥ 2.9.6	github.com/mikolmogorov/Flye
FQsum	-	≥ 0.5	gitlab.pasteur.fr/GIPhy/FQsum
minidna	-	≥ 25.12	gitlab.pasteur.fr/GIPhy/minidna
minimap2	-	≥ 2.30	github.com/lh3/minimap2
ntCard	-	≥ 1.2.2	github.com/bcgsc/ntCard
ROCK	-	≥ 3.0	gitlab.pasteur.fr/vlegrand/ROCK

Optional programs

program	package	version	sources
bzip2	-	> 1.0.0	sourceware.org/bzip2/downloads.html
DSRC	-	≥ 2.0	github.com/refresh-bio/DSRC
gzip	-	> 1.5.0	ftp.gnu.org/gnu/gzip
pigz	-	> 2.7	github.com/madler/pigz

Standard GNU packages and utilities

program	package	version	sources
cat cp du echo mkdir mktemp mv rm paste sort tr wc	coreutils^★	> 8.0	ftp.gnu.org/gnu/coreutils
gawk	-	> 4.0.0	ftp.gnu.org/gnu/gawk
grep	-	> 2.0	ftp.gnu.org/gnu/grep
sed	-	> 4.2	ftp.gnu.org/gnu/sed

^★ On OS X, the package coreutils can be installed via Homebrew (e.g. brew install coreutils).

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/nanodna.git

B. Go to the created directory and give the execute permission to the file nanodna.sh:

cd nanodna/
chmod +x nanodna.sh

C. Check the dependencies (and their version) using the following command line:

./nanodna.sh  -d

If at least one of the mandatory program (see Dependencies) is not available in your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file nanodna.sh and indicate the local path to the corresponding binary(ies) within the code block DEPENDENCIES (approximately lines 110-180). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS.

program	variable assignment	program	variable assignment
AlienDiscover	`ALIENDISCOVER_BIN=AlienDiscover;`	gawk	`GAWK_BIN=gawk;`
AlienTrimmer	`ALIENTRIMMER_BIN=AlienTrimmer;`	gzip^★	`GZIP_BIN=gzip;`
bzip2^★	`BZIP2_BIN=bzip2;`	minidna	`MINIDNA_BIN=minidna;`
CoPro	`COPRO_BIN=CoPro;`	minimap2	`MINIMAP2_BIN=minimap2;`
dnaapler	`DNAAPLER_BIN=dnaapler;`	ntcard	`NTCARD_BIN=ntcard;`
DSRC^★	`DSRC_BIN=dsrc;`	pigz^★	`PIGZ_BIN=pigz;`
Flye	`FLYE_BIN=flye;`	ROCK	`ROCK_BIN=rock;`
FQsum	`FQSUM_BIN=FQsum;`

^★ Optional programs are not required to be installed/accessible for nanodna to run.

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as FQsum is a Java tool that can be (quickly) run using a Java virtual machine, the executable jar file FQsum.jar can be used by nanodna by editing the corresponding variable assignment instruction as follows: FQSUM_BIN="java -jar $path_to/FQsum.jar".

D. Execute nanodna with the following command line model:

./nanodna.sh  [options]

Usage

Run nanodna without option to read the following documentation:

 Processing and assembling high-quality nanopore sequencing reads (Q20>80%)
 https://gitlab.pasteur.fr/GIPhy/nanodna

 USAGE:  nanodna  [options]

 OPTIONS:
  -i <infile>  FASTA/FASTQ input file name(s);  multiple files can be specified,  separated by
               commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
               fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
               or bz2), or DSRC (dsrc or dsrc2)
  -o <outdir>  name of the output directory (mandatory option)
  -b <string>  base name for output files (mandatory option)
  -n           already trimmed input data (default: not set)
  -c <int>     digital normalization coverage depth (default: 80)
  -f           force (slower) deterministic assembly
  -t <int>     thread numbers (default: 12)
  -w <dir>     path to the tmp directory (default: $TMPDIR, otherwise /tmp)
  -x           to not remove (after completing) the tmp directory (default: not set)
  -d           checks dependencies and exit
  -h           prints this help and exit

Notes

nanodna can consider up to 15 input files (mandatory option -i). Input files should be in FASTA (file extensions .fa or .fasta) or FASTQ format (file extensions .fq or fastq). FASTQ files can also be compressed using several tools, provided that the file extension matches the expected one (i.e. bzip2: .bz or .bz2; DSRC: .dsrc or .dsrc2; gzip: .gz) and that the corresponding binary is installed.
Read clipping/trimming, as well as quality- and length-based filtering, are automatically performed using AlienTrimmer (Criscuolo & Brisse 2013). However, it is recommended to set the option -n when technical oligonucleotides were already clipped.
One important step of nanodna is the digital normalization of the filtered reads. It selects the longest and the least erroneous reads using ROCK (Legrand et al. 2022) in order to build a subset that ensures a reduced and uniform coverage depth across every position of the sequenced genome. By default, the expected coverage depth is set to 80× to obtain a good trade-off between high accuracy and short running times; however, this default coverage depth can be modified using option -c.
A genome assembly is next inferred using Flye (Lin et al. 2016, Kolmogorov et al. 2019). This initial assembly is then processed to (i) cut small circular sequences that are fully duplicated within a single contig, (ii) reorient contigs using dnaapler (Bouras et al. 2024), and (iii) polish the modified contigs using Flye. In parallel, small circular contigs (i.e. < 8000 bps; if any) are independently assembled using minidna. These different steps are finally followed by the computation of the genome coverage profile using CoPro from minimap2 alignments (Li 2018, 2021) in order to assess the overall accuracy of the assembled sequences and to detect their weaknesses (e.g. under-covered regions, multi-allelic positions). The required running times of every step can be reduced by using multiple threads (option -t; default: 12) or by setting a temporary directory (option -w) located on a fast solid-state drive (SSD).
Output files are all defined by the same specified prefix (mandatory option -b) and written into a specified output directory (mandatory option -o). Each output file content is determined by its file extension:

output file	file content
<prefix>.aliens.fasta	technical/artefactual oligonucleotides inferred using AlienDiscover in FASTA format
<prefix>.0.fasta	initial genome assembly inferred using Flye in FASTA format
<prefix>.1.cov.info.txt	genome coverage profile estimated by CoPro from minimap2 read alignments
<prefix>.1.fasta	enhanced genome assembly in FASTA format
<prefix>.1.seq.info.txt	descriptive statistics of the sequence(s) in <prefix>.1.fasta
<prefix>.1.amb.info.txt	ambigous (multi-allelic) positions in <prefix>.1.fasta with approximated base compositions

Example

In order to illustrate the usefulness of nanodna and to better describe its output files, the following example describes its usage for assembling the genome of Escherichia coli isolate I (Lerminiaux et al. 2024).

All output files are available in the directory example/ (genome sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).

Downloading input file

Long-read sequencing of this genome was performed using a MinION apparatus, and the resulting (compressed) FASTQ file (750 Mb) can be downloaded using the following command lines:

EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;

Running nanodna

Below is a typical command line to run nanodna (default: 12 threads), by specifying the input file (option -i), the output directory (option -o) and the output file name prefix (option -b):

nanodna  -i SRR26162837.fastq.gz  -o example  -b Ecoli.I  -n

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using nanodna together with program and tool versions listed into example/program.versions.txt (note that different outputs and statistics can be obtained as the de novo assembly tool Flye is not fully deterministic when used on multiple threads):

# nanodna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/nanodna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=zz
+ TMPDIR=/tmp/nanodna.Ecoli.I.7HhnkcU7F
[00:00] reading input file ... [ok]
+ [748.4 Mb]  SRR26162837.fastq.gz
[00:09] approximating genome size ... [ok]
> 5026618 bps
[00:21] input file processing ... [ok]
+ Nreads=111542  Nbases=795978646  N90=3121  %Q20=80.73  cov=158x
[00:37] data filtering ...... [ok]
+ Nreads=110689  Nbases=787409761  N90=3122  %Q20=81.17  cov=156x
[01:32] digital normalization ...... [ok]
+ Nreads=36298  Nbases=492923768  N90=7774  %Q20=82.22  cov=98x
[03:50] checking small plasmid(s) ....... [ok]
> Nseq=4  Nres=12212  Ncirc=4  %GC=49.4
[04:24] de novo assembly ... [ok]
> Nseq=3  Nres=5043921  Ncirc=3  %GC=50.8
[16:35] assembly processing ...... [ok]
> Nseq=7  Nres=5056122  Namb=213  %weak=3.78  accuracy=0.7724
[21:58] output files:
+ genome assembly (initial):   zz/Ecoli.I.0.fasta
+ coverage profile (enhanced): zz/Ecoli.I.1.cov.info.txt
+ genome assembly (enhanced):  zz/Ecoli.I.1.fasta
+ sequence stats (enhanced):   zz/Ecoli.I.1.seq.info.txt
+ ambiguous positions:         zz/Ecoli.I.1.amb.info.txt
[22:04] exit

This log output shows that the complete analysis was performed on 12 threads in less than 25 minutes. Based on the approximated genome length (i.e. 5.02 Mbps), the initial raw read set corresponds to an average coverage depth of 158×. The filtering and digital normalization steps led to a reduced subset of 36,298 high-quality long reads, while increasing the Q20 metric from 80.73% to 82.22% and the N90 one from 3,121 to 7,774. Of note, as this datafile was considered as already trimmed (option -n), the inferred alien sequences are only low residue content ones (see example/Ecoli.I.aliens.fasta).

The dedicated assembly of small circular contigs (i.e. < 8000 bps) led to four plasmid sequences in ~1 minute (minidna), whereas the other three large contigs were assembled in ~12 minutes (Flye). Each of the seven assembled sequences are circular, with respective lengths comparable to the expected results, i.e. 4,914,387, 76,261, 53,262, 5,167, 4,074, 1,538 and 1,433 bps for SEQ0001-SEQ0007, respectively (see example/Ecoli.I.1.seq.info.txt), whereas the (hybrid) public assembly consists of one chromosome (NZ_CP135458; 4,914,339 bps) and six plasmids: NZ_CP135459 (76,262 bps), NZ_CP135460 (53,262 bps), NZ_CP135461 (5,167 bps), NZ_CP135462 (4,074 bps), NZ_CP135463 (1,538 bps) and NZ_CP135464 (1,433 bps). Reoriented SEQ0001 starts with a dnaA gene showing that this contig corresponds to a chromosome, whereas SEQ0002-SEQ0007 start with repA genes, assessing that they each correspond to a plasmid.

The assembly contains 213 multi-allelic positions (mainly in the chromosome SEQ0001; see example/Ecoli.I.1.amb.info.txt) and 3.78% of “weak” positions, i.e. assembled positions that are significantly undercovered (i.e. < 70×) by high quality sequenced bases (written in lower case in example/Ecoli.I.1.fasta). It is worth noting that the genome coverage profile is slightly negatively skewed (see example/Ecoli.I.1.cov.info.txt), confirming that some regions of the genome are quite undercovered by the sequencing data. This yields an overall accuracy of 0.7724 (obtained by multiplying the momo and R² quality metrics provided by the tool CoPro; see example/Ecoli.I.1.cov.info.txt), which remains very acceptable.

References

Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968

Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011

Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.

Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113