minidna

minidna (mini-contig de novo assembly) is a command line tool written in Bash to quickly infer small circular contigs (e.g. < 10,000 bps) from long-read sequencing data.

In brief, minidna first searches for subset(s) of circular reads deriving from the same contig, and next infers the circular contig corresponding to each long-read subset. This simple approach runs fast and requires few dependencies, therefore representing a useful alternative to other tools, such as Plassembler (Bouras et al. 2023). minidna can then be used in addition to any long-read-only and/or long-read-first assemblers in order to compensate their potential failure to recover small plasmids (e.g. Johnson et al. 2023).

minidna runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.

Mandatory programs

program	package	version	sources
minimap2	-	≥ 2.28	github.com/lh3/minimap2

Optional programs

program	package	version	sources
bzip2	-	> 1.0.0	sourceware.org/bzip2/downloads.html
DSRC	-	≥ 2.0	github.com/refresh-bio/DSRC
gzip	-	> 1.5.0	ftp.gnu.org/gnu/gzip
pigz	-	> 2.7	github.com/madler/pigz
dnaapler	-	≥ 1.3.0	github.com/gbouras13/dnaapler
Flye	-	≥ 2.9.6	github.com/mikolmogorov/Flye

Standard GNU packages and utilities

program	package	version	sources
cat cp du echo mkdir mktemp mv od rm paste sort tr wc	coreutils^★	> 8.0	ftp.gnu.org/gnu/coreutils
gawk	-	> 4.0.0	ftp.gnu.org/gnu/gawk
grep	-	> 2.0	ftp.gnu.org/gnu/grep
sed	-	> 4.2	ftp.gnu.org/gnu/sed

^★ On OS X, the package coreutils can be installed via Homebrew (e.g. brew install coreutils).

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/minidna.git

B. Go to the created directory and give the execute permission to the file nanodna.sh:

cd minidna/
chmod +x minidna.sh

C. Check the dependencies (and their version) using the following command line:

./minidna.sh  -c

If at least one of the mandatory program (see Dependencies) is not available in your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file nanodna.sh and indicate the local path to the corresponding binary(ies) within the code block DEPENDENCIES (approximately lines 110-180). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS.

program	variable assignment	program	variable assignment
bzip2^★	`BZIP2_BIN=bzip2;`	gzip^★	`GZIP_BIN=gzip;`
dnaapler^★	`DNAAPLER_BIN=dnaapler;`	gawk	`GAWK_BIN=gawk;`
DSRC^★	`DSRC_BIN=dsrc;`	pigz^★	`PIGZ_BIN=pigz;`
Flye^★	`FLYE_BIN=flye;`	minimap2	`MINIMAP2_BIN=minimap2;`

^★ Optional programs are not required to be installed/accessible for nanodna to run.

D. Execute minidna with the following command line model:

./minidna.sh  [options]

Usage

Run minidna without option to read the following documentation:

 Inferring small circular contigs (< 10,000 bps) from long-read FASTQ files
 https://gitlab.pasteur.fr/GIPhy/minidna

 USAGE:  minidna  -i <fastq>[,<fastq>,...]  -o <outdir>  -b <basename>  [options]

 OPTIONS:
  -i <file>    FASTA/FASTQ input file name(s);  multiple files can be specified,  separated by
               commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
               fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
               or bz2), or DSRC (dsrc or dsrc2)
  -o <dir>     name of the output directory (mandatory option)
  -b <string>  base name for output files (mandatory option)
  -l <int>     maximum contig length (default: 10000)
  -m <int>     minimum coverage depth per contig (default: 30)
  -d <int>     delta sensitivity parameter (default: 50)
  -s           subsample large read subsets (default: not set)
  -e           extensive inference (default: not set)
  -p           polish contigs (default: not set)
  -r           reorient contigs (default: not set)
  -f           output self-circularized reads for each contig (default: not set)
  -t <int>     thread numbers (default: 12)
  -w <dir>     path to the tmp directory (default: $TMPDIR, otherwise /tmp)
  -c           checks dependencies and exit
  -V           prints version and exit
  -h           prints this help and exit

Notes

minidna can read any number of input files (mandatory option -i). Input files should be in FASTQ (file extensions .fq or fastq) or FASTA format (file extensions .fa or .fasta). FASTQ files can also be compressed using several tools, provided that the file extension matches the expected one (i.e. bzip2: .bz or .bz2; DSRC: .dsrc or .dsrc2; gzip: .gz) and that the corresponding binary is installed.
Putative candidate reads are determined by considering the different high modes (if any) within the length distribution induced by the selected reads (i.e. lengths lower than the specified upper-bound; option -l set to 10,000 bps by default). For each subset of candidate reads, pairwise alignments are carried out using minimap2 (Li 2018, 2021). Similar and circular reads are next clustered, and the resulting contig is returned from each read cluster of sufficient size (i.e. larger than the specified minimum coverage depth; option -m set to 30× by default). For each read cluster, the returned contig is simply the read with the maximum Phred score sum (note that each returned contig is trivially circular, as it is a circular read).
To always observe accurate inference, it is recommended to use long reads that have been preliminarily trimmed (i.e. cutting low quality ends) and clipped (i.e. cutting technical/artefactual oligonucleotides). Trimming and clipping can be performed using e.g. AlienTrimmer with alien oligonucleotides inferred by AlienDiscover.
The option -e enables to run a more extensive search of circular contigs. It completes the initial candidate lengths (i.e corresponding to high modes within the read length distribution, if any) with every detected local mode. This approach is slower, but allows to recover those small plasmids that are not significantly overrepresented within the read dataset (e.g. SRR26162835). It is worth noting that this extensive inference can sometimes infer duplicate contigs (e.g. SRR31896212).
The sensitivity of the overall procedure is settled by a unique parameter Δ (= 50 by default; option -d). It is recommended to not modify this parameter; decreasing Δ can lead to redundancy in the inferred contigs (i.e. distinct read subsets corresponding to the same contig), whereas increasing Δ can yield incorrect contig inferences (e.g. mixture of circular reads corresponding to different contigs).
Each inferred contig can be reoriented by setting option -r, provided that the tool dnaapler (Bouras et al. 2024) is installed. Each inferred contig can also be polished using its corresponding circular read set by setting option -p, provided that the tool Flye (Lin et al. 2016, Kolmogorov et al. 2019) is installed. Of note, the circular reads associated to each inferred contig can be written into a FASTQ file (option -f) in order to use any alternative polishing tools.
When dealing with large FASTQ files (e.g. > 1 Gbps), it is recommended to set the option -s to subsample the large subsets of candidate reads (e.g. > 2,000 reads); option -s significantly decreases the overall running times, with generally no impact on the returned contigs (except their estimated coverage depth). Faster running times can be also observed when using a large number of threads (option -t set to 12 by default) and/or by using a temporary directory (option -w) located on a fast solid-state drive (SSD).
Output files are all defined by the same specified prefix (mandatory option -b) and written into the specified output directory (mandatory option -o). Each output file content is determined by its file extension:

output file	file content
<prefix>.lgt.txt	read length distribution, with max length determined by option `-l`
<prefix>.fasta	inferred contig(s) in FASTA format
<prefix>.info.txt	descriptive statistics of the inferred contigs
<prefix>.*.fastq	circular reads associated to each inferred contig in FASTQ format

Example

In order to illustrate the usefulness of minidna and to better describe its output files, the following example describes its usage for assembling small plasmids of Escherichia coli isolate I from high-quality long-reads generated by a Oxford Nanopore MinION instrument (Lerminiaux et al. 2024). This isolate is expected to contain four small plasmids of diverse lengths, i.e. 5,167 bps (CP135461), 4,074 bps (CP135462), 1,538 bps (CP135463), 1,433 bps (CP135464).

Downloading input files

The compressed FASTQ file (750 Mb) can be downloaded using the following command lines:

EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;

Running minidna

Below is one standard way to run minidna:

./minidna.sh -i SRR26162837.fastq.gz -o example -b SRR26162837 -p -r

Of note, both options -p and -r were set to infer high-quality contigs (i.e. polished and reoriented, respectively).

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies):

# minidna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/minidna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=example
+ TMPDIR=/tmp/minidna.SRR26162837.aRCNnbQPv
[00:00] reading input file ... [ok]
+ [748.4 Mb]  SRR26162837.fastq.gz
[00:06] read filtering ... [ok]
> Nreads=88060
> Ncandidates=4
[00:08] processing 4150 reads ~1410 bps ... [ok]
[00:49] inferring circular contig ...... [ok]
+ ctg0001  length=1433  coverage=3199
[00:55] processing 5822 reads ~1510 bps ... [ok]
[02:31] inferring circular contig ...... [ok]
+ ctg0002  length=1538  coverage=4844
[02:40] processing 1781 reads ~4050 bps ... [ok]
[02:56] inferring circular contig ...... [ok]
+ ctg0003  length=4074  coverage=1282
[03:05] processing 2239 reads ~5140 bps ... [ok]
[03:49] inferring circular contig ...... [ok]
+ ctg0004  length=5167  coverage=1853
> Ncontigs=4
[03:58] exit

As shown by the above log, minidna inferred the four small plasmids in ~4 minutes on default 12 threads.

After filtering the long reads to only keep those of length ≤ 10,000 bps (option -l), their length distribution was computed (see example/SRR26162837.lgt.txt). As expected (and especially when using rapid ONT preparation kits; Wick et al. 2021), the presence of small plasmids of length ℓ led to a clear overrepresentation of long-reads of length near ℓ, as shown by the following graphical representation.

minidna then considered the four high modes m (i.e. 1,410, 1,510, 4,050 and 5,140 bps), and, for each of them, it gathered the reads of lengths m ± Δ/2. For each subset of gathered reads, quite identical and circular ones were assessed by parwise alignments and next clustered. For each cluster, an accurate read was selected as a complete circular contig, which was next reoriented and polished.

The four inferred contigs are available in example/SRR26162837.fasta. Of note, ctg0001, ctg0002, ctg0003 and ctg0004 are 100% identical to CP135464, CP135463, CP135462 and CP135461, respectively.

Note that the very same inference can be obtained using either option -e (requiring ~30 more seconds) or option -s (halving the initial running time).

References

Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968

Bouras G, Sheppard AE, Mallawaarachchi V, Vreugde S (2023) Plassembler: an automated bacterial plasmid assembly tool. Bioinformatics, 39(7):btad409. doi:10.1093/bioinformatics/btad409

Johnson J, Soehnlen M, Blankenship HM (2023) Long read genome assemblers struggle with small plasmids. Microbial Genomics, 9(5):001024. doi:10.1099/mgen.0.001024

Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8

Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113

Wick RR, Judd LM, Wyres KL, Holt KE (2021) Recovery of small plasmid sequences via Oxford Nanopore sequencing. Microbial Genomics, 7:000631. doi:10.1099/mgen.0.000631