Dependencies | Installation and execution | Usage | Notes | Example | References | Citations


GPLv3 license Bash

nanodna

nanodna (nanopore de novo assembly) is a command line tool written in Bash to ease the de novo assembly of prokaryote genomes from high-quality nanopore sequencing reads (e.g. Q20 > 80%).

Every pre- and post-processing step is managed by nanodna (e.g. read filtering, de novo assembly, assembled sequence enhancement). Its main purpose is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 10 to 30 minutes using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.

nanodna runs on UNIX, Linux and most OS X operating systems.

nanodna

Dependencies

You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.

Mandatory programs
program package version sources
AlienDiscover - ≥ 0.3 gitlab.pasteur.fr/GIPhy/AlienDiscover
AlienTrimmer - ≥ 3.1 gitlab.pasteur.fr/GIPhy/AlienTrimmer
CoPro - ≥ 0.3 gitlab.pasteur.fr/GIPhy/CoPro
dnaapler - ≥ 1.3.0 github.com/gbouras13/dnaapler
Flye - ≥ 2.9.6 github.com/mikolmogorov/Flye
FQsum - ≥ 0.5 gitlab.pasteur.fr/GIPhy/FQsum
minidna - ≥ 25.12 gitlab.pasteur.fr/GIPhy/minidna
minimap2 - ≥ 2.30 github.com/lh3/minimap2
ntCard - ≥ 1.2.2 github.com/bcgsc/ntCard
ROCK - ≥ 3.0 gitlab.pasteur.fr/vlegrand/ROCK
Optional programs
program package version sources
bzip2 - > 1.0.0 sourceware.org/bzip2/downloads.html
DSRC - ≥ 2.0 github.com/refresh-bio/DSRC
gzip - > 1.5.0 ftp.gnu.org/gnu/gzip
pigz - > 2.7 github.com/madler/pigz
Standard GNU packages and utilities
program package version sources
cat
cp
du
echo
mkdir
mktemp
mv
rm
paste
sort
tr
wc
coreutils > 8.0 ftp.gnu.org/gnu/coreutils
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
grep - > 2.0 ftp.gnu.org/gnu/grep
sed - > 4.2 ftp.gnu.org/gnu/sed

On OS X, the package coreutils can be installed via Homebrew (e.g. brew install coreutils).

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/nanodna.git

B. Go to the created directory and give the execute permission to the file nanodna.sh:

cd nanodna/
chmod +x nanodna.sh

C. Check the dependencies (and their version) using the following command line:

./nanodna.sh  -d
If at least one of the mandatory program (see Dependencies) is not available in your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file nanodna.sh and indicate the local path to the corresponding binary(ies) within the code block DEPENDENCIES (approximately lines 110-180). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS.

program variable assignment program variable assignment
AlienDiscover ALIENDISCOVER_BIN=AlienDiscover; gawk GAWK_BIN=gawk;
AlienTrimmer ALIENTRIMMER_BIN=AlienTrimmer; gzip GZIP_BIN=gzip;
bzip2 BZIP2_BIN=bzip2; minidna MINIDNA_BIN=minidna;
CoPro COPRO_BIN=CoPro; minimap2 MINIMAP2_BIN=minimap2;
dnaapler DNAAPLER_BIN=dnaapler; ntcard NTCARD_BIN=ntcard;
DSRC DSRC_BIN=dsrc; pigz PIGZ_BIN=pigz;
Flye FLYE_BIN=flye; ROCK ROCK_BIN=rock;
FQsum FQSUM_BIN=FQsum;

Optional programs are not required to be installed/accessible for nanodna to run.

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as FQsum is a Java tool that can be (quickly) run using a Java virtual machine, the executable jar file FQsum.jar can be used by nanodna by editing the corresponding variable assignment instruction as follows: FQSUM_BIN="java -jar $path_to/FQsum.jar".

D. Execute nanodna with the following command line model:

./nanodna.sh  [options]

Usage

Run nanodna without option to read the following documentation:

 Processing and assembling high-quality nanopore sequencing reads (Q20>80%)
 https://gitlab.pasteur.fr/GIPhy/nanodna

 USAGE:  nanodna  [options]

 OPTIONS:
  -i <infile>  FASTA/FASTQ input file name(s);  multiple files can be specified,  separated by
               commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
               fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
               or bz2), or DSRC (dsrc or dsrc2)
  -o <outdir>  name of the output directory (mandatory option)
  -b <string>  base name for output files (mandatory option)
  -n           already trimmed input data (default: not set)
  -c <int>     digital normalization coverage depth (default: 80)
  -f           force (slower) deterministic assembly
  -t <int>     thread numbers (default: 12)
  -w <dir>     path to the tmp directory (default: $TMPDIR, otherwise /tmp)
  -x           to not remove (after completing) the tmp directory (default: not set)
  -d           checks dependencies and exit
  -h           prints this help and exit

Notes

output file file content
<prefix>.aliens.fasta technical/artefactual oligonucleotides inferred using AlienDiscover in FASTA format
<prefix>.0.fasta initial genome assembly inferred using Flye in FASTA format
<prefix>.1.cov.info.txt genome coverage profile estimated by CoPro from minimap2 read alignments
<prefix>.1.fasta enhanced genome assembly in FASTA format
<prefix>.1.seq.info.txt descriptive statistics of the sequence(s) in <prefix>.1.fasta
<prefix>.1.amb.info.txt ambigous (multi-allelic) positions in <prefix>.1.fasta with approximated base compositions

Example

In order to illustrate the usefulness of nanodna and to better describe its output files, the following example describes its usage for assembling the genome of Escherichia coli isolate I (Lerminiaux et al. 2024).

All output files are available in the directory example/ (genome sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).

Downloading input file

Long-read sequencing of this genome was performed using a MinION apparatus, and the resulting (compressed) FASTQ file (750 Mb) can be downloaded using the following command lines:

EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;
Running nanodna

Below is a typical command line to run nanodna (default: 12 threads), by specifying the input file (option -i), the output directory (option -o) and the output file name prefix (option -b):

nanodna  -i SRR26162837.fastq.gz  -o example  -b Ecoli.I  -n

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using nanodna together with program and tool versions listed into example/program.versions.txt (note that different outputs and statistics can be obtained as the de novo assembly tool Flye is not fully deterministic when used on multiple threads):

# nanodna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/nanodna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=zz
+ TMPDIR=/tmp/nanodna.Ecoli.I.7HhnkcU7F
[00:00] reading input file ... [ok]
+ [748.4 Mb]  SRR26162837.fastq.gz
[00:09] approximating genome size ... [ok]
> 5026618 bps
[00:21] input file processing ... [ok]
+ Nreads=111542  Nbases=795978646  N90=3121  %Q20=80.73  cov=158x
[00:37] data filtering ...... [ok]
+ Nreads=110689  Nbases=787409761  N90=3122  %Q20=81.17  cov=156x
[01:32] digital normalization ...... [ok]
+ Nreads=36298  Nbases=492923768  N90=7774  %Q20=82.22  cov=98x
[03:50] checking small plasmid(s) ....... [ok]
> Nseq=4  Nres=12212  Ncirc=4  %GC=49.4
[04:24] de novo assembly ... [ok]
> Nseq=3  Nres=5043921  Ncirc=3  %GC=50.8
[16:35] assembly processing ...... [ok]
> Nseq=7  Nres=5056122  Namb=213  %weak=3.78  accuracy=0.7724
[21:58] output files:
+ genome assembly (initial):   zz/Ecoli.I.0.fasta
+ coverage profile (enhanced): zz/Ecoli.I.1.cov.info.txt
+ genome assembly (enhanced):  zz/Ecoli.I.1.fasta
+ sequence stats (enhanced):   zz/Ecoli.I.1.seq.info.txt
+ ambiguous positions:         zz/Ecoli.I.1.amb.info.txt
[22:04] exit

This log output shows that the complete analysis was performed on 12 threads in less than 25 minutes. Based on the approximated genome length (i.e. 5.02 Mbps), the initial raw read set corresponds to an average coverage depth of 158×. The filtering and digital normalization steps led to a reduced subset of 36,298 high-quality long reads, while increasing the Q20 metric from 80.73% to 82.22% and the N90 one from 3,121 to 7,774. Of note, as this datafile was considered as already trimmed (option -n), the inferred alien sequences are only low residue content ones (see example/Ecoli.I.aliens.fasta).

The dedicated assembly of small circular contigs (i.e. < 8000 bps) led to four plasmid sequences in ~1 minute (minidna), whereas the other three large contigs were assembled in ~12 minutes (Flye). Each of the seven assembled sequences are circular, with respective lengths comparable to the expected results, i.e. 4,914,387, 76,261, 53,262, 5,167, 4,074, 1,538 and 1,433 bps for SEQ0001-SEQ0007, respectively (see example/Ecoli.I.1.seq.info.txt), whereas the (hybrid) public assembly consists of one chromosome (NZ_CP135458; 4,914,339 bps) and six plasmids: NZ_CP135459 (76,262 bps), NZ_CP135460 (53,262 bps), NZ_CP135461 (5,167 bps), NZ_CP135462 (4,074 bps), NZ_CP135463 (1,538 bps) and NZ_CP135464 (1,433 bps). Reoriented SEQ0001 starts with a dnaA gene showing that this contig corresponds to a chromosome, whereas SEQ0002-SEQ0007 start with repA genes, assessing that they each correspond to a plasmid.

The assembly contains 213 multi-allelic positions (mainly in the chromosome SEQ0001; see example/Ecoli.I.1.amb.info.txt) and 3.78% of “weak” positions, i.e. assembled positions that are significantly undercovered (i.e. < 70×) by high quality sequenced bases (written in lower case in example/Ecoli.I.1.fasta). It is worth noting that the genome coverage profile is slightly negatively skewed (see example/Ecoli.I.1.cov.info.txt), confirming that some regions of the genome are quite undercovered by the sequencing data. This yields an overall accuracy of 0.7724 (obtained by multiplying the momo and R2 quality metrics provided by the tool CoPro; see example/Ecoli.I.1.cov.info.txt), which remains very acceptable.

References

Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968

Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011

Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.

Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113