Dependencies | Installation and execution | Usage | Notes | Example | References | Citations


GPLv3 license Bash

minidna

minidna (mini-contig de novo assembly) is a command line tool written in Bash to quickly infer small circular contigs (e.g. < 10,000 bps) from long-read sequencing data.

In brief, minidna first searches for subset(s) of circular reads deriving from the same contig, and next infers the circular contig corresponding to each long-read subset. This simple approach runs fast and requires few dependencies, therefore representing a useful alternative to other tools, such as Plassembler (Bouras et al. 2023). minidna can then be used in addition to any long-read-only and/or long-read-first assemblers in order to compensate their potential failure to recover small plasmids (e.g. Johnson et al. 2023).

minidna runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.

Mandatory programs
program package version sources
minimap2 - ≥ 2.28 github.com/lh3/minimap2
Optional programs
program package version sources
bzip2 - > 1.0.0 sourceware.org/bzip2/downloads.html
DSRC - ≥ 2.0 github.com/refresh-bio/DSRC
gzip - > 1.5.0 ftp.gnu.org/gnu/gzip
pigz - > 2.7 github.com/madler/pigz
dnaapler - ≥ 1.3.0 github.com/gbouras13/dnaapler
Flye - ≥ 2.9.6 github.com/mikolmogorov/Flye
Standard GNU packages and utilities
program package version sources
cat
cp
du
echo
mkdir
mktemp
mv
od
rm
paste
sort
tr
wc
coreutils > 8.0 ftp.gnu.org/gnu/coreutils
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
grep - > 2.0 ftp.gnu.org/gnu/grep
sed - > 4.2 ftp.gnu.org/gnu/sed

On OS X, the package coreutils can be installed via Homebrew (e.g. brew install coreutils).

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/minidna.git

B. Go to the created directory and give the execute permission to the file nanodna.sh:

cd minidna/
chmod +x minidna.sh

C. Check the dependencies (and their version) using the following command line:

./minidna.sh  -c

If at least one of the mandatory program (see Dependencies) is not available in your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file nanodna.sh and indicate the local path to the corresponding binary(ies) within the code block DEPENDENCIES (approximately lines 110-180). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS.

program variable assignment program variable assignment
bzip2 BZIP2_BIN=bzip2; gzip GZIP_BIN=gzip;
dnaapler DNAAPLER_BIN=dnaapler; gawk GAWK_BIN=gawk;
DSRC DSRC_BIN=dsrc; pigz PIGZ_BIN=pigz;
Flye FLYE_BIN=flye; minimap2 MINIMAP2_BIN=minimap2;

Optional programs are not required to be installed/accessible for nanodna to run.

D. Execute minidna with the following command line model:

./minidna.sh  [options]

Usage

Run minidna without option to read the following documentation:

 Inferring small circular contigs (< 10,000 bps) from long-read FASTQ files
 https://gitlab.pasteur.fr/GIPhy/minidna

 USAGE:  minidna  -i <fastq>[,<fastq>,...]  -o <outdir>  -b <basename>  [options]

 OPTIONS:
  -i <file>    FASTA/FASTQ input file name(s);  multiple files can be specified,  separated by
               commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
               fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
               or bz2), or DSRC (dsrc or dsrc2)
  -o <dir>     name of the output directory (mandatory option)
  -b <string>  base name for output files (mandatory option)
  -l <int>     maximum contig length (default: 10000)
  -m <int>     minimum coverage depth per contig (default: 30)
  -d <int>     delta sensitivity parameter (default: 50)
  -s           subsample large read subsets (default: not set)
  -e           extensive inference (default: not set)
  -p           polish contigs (default: not set)
  -r           reorient contigs (default: not set)
  -f           output self-circularized reads for each contig (default: not set)
  -t <int>     thread numbers (default: 12)
  -w <dir>     path to the tmp directory (default: $TMPDIR, otherwise /tmp)
  -c           checks dependencies and exit
  -V           prints version and exit
  -h           prints this help and exit

Notes

output file file content
<prefix>.lgt.txt read length distribution, with max length determined by option -l
<prefix>.fasta inferred contig(s) in FASTA format
<prefix>.info.txt descriptive statistics of the inferred contigs
<prefix>.*.fastq circular reads associated to each inferred contig in FASTQ format

Example

In order to illustrate the usefulness of minidna and to better describe its output files, the following example describes its usage for assembling small plasmids of Escherichia coli isolate I from high-quality long-reads generated by a Oxford Nanopore MinION instrument (Lerminiaux et al. 2024). This isolate is expected to contain four small plasmids of diverse lengths, i.e. 5,167 bps (CP135461), 4,074 bps (CP135462), 1,538 bps (CP135463), 1,433 bps (CP135464).

Downloading input files

The compressed FASTQ file (750 Mb) can be downloaded using the following command lines:

EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;
Running minidna

Below is one standard way to run minidna:

./minidna.sh -i SRR26162837.fastq.gz -o example -b SRR26162837 -p -r

Of note, both options -p and -r were set to infer high-quality contigs (i.e. polished and reoriented, respectively).

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies):

# minidna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/minidna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=example
+ TMPDIR=/tmp/minidna.SRR26162837.aRCNnbQPv
[00:00] reading input file ... [ok]
+ [748.4 Mb]  SRR26162837.fastq.gz
[00:06] read filtering ... [ok]
> Nreads=88060
> Ncandidates=4
[00:08] processing 4150 reads ~1410 bps ... [ok]
[00:49] inferring circular contig ...... [ok]
+ ctg0001  length=1433  coverage=3199
[00:55] processing 5822 reads ~1510 bps ... [ok]
[02:31] inferring circular contig ...... [ok]
+ ctg0002  length=1538  coverage=4844
[02:40] processing 1781 reads ~4050 bps ... [ok]
[02:56] inferring circular contig ...... [ok]
+ ctg0003  length=4074  coverage=1282
[03:05] processing 2239 reads ~5140 bps ... [ok]
[03:49] inferring circular contig ...... [ok]
+ ctg0004  length=5167  coverage=1853
> Ncontigs=4
[03:58] exit

As shown by the above log, minidna inferred the four small plasmids in ~4 minutes on default 12 threads.

After filtering the long reads to only keep those of length ≤ 10,000 bps (option -l), their length distribution was computed (see example/SRR26162837.lgt.txt). As expected (and especially when using rapid ONT preparation kits; Wick et al. 2021), the presence of small plasmids of length ℓ led to a clear overrepresentation of long-reads of length near ℓ, as shown by the following graphical representation.

minidna

minidna then considered the four high modes m (i.e. 1,410, 1,510, 4,050 and 5,140 bps), and, for each of them, it gathered the reads of lengths m ± Δ/2. For each subset of gathered reads, quite identical and circular ones were assessed by parwise alignments and next clustered. For each cluster, an accurate read was selected as a complete circular contig, which was next reoriented and polished.

The four inferred contigs are available in example/SRR26162837.fasta. Of note, ctg0001, ctg0002, ctg0003 and ctg0004 are 100% identical to CP135464, CP135463, CP135462 and CP135461, respectively.

Note that the very same inference can be obtained using either option -e (requiring ~30 more seconds) or option -s (halving the initial running time).

References

Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968

Bouras G, Sheppard AE, Mallawaarachchi V, Vreugde S (2023) Plassembler: an automated bacterial plasmid assembly tool. Bioinformatics, 39(7):btad409. doi:10.1093/bioinformatics/btad409

Johnson J, Soehnlen M, Blankenship HM (2023) Long read genome assemblers struggle with small plasmids. Microbial Genomics, 9(5):001024. doi:10.1099/mgen.0.001024

Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8

Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113

Wick RR, Judd LM, Wyres KL, Holt KE (2021) Recovery of small plasmid sequences via Oxford Nanopore sequencing. Microbial Genomics, 7:000631. doi:10.1099/mgen.0.000631