Dependencies | Installation and execution | Usage | Notes | Example | References | Citations
minidna (mini-contig de novo assembly) is a command line tool written in Bash to quickly infer small circular contigs (e.g. < 10,000 bps) from long-read sequencing data.
In brief, minidna first searches for subset(s) of circular reads deriving from the same contig, and next infers the circular contig corresponding to each long-read subset. This simple approach runs fast and requires few dependencies, therefore representing a useful alternative to other tools, such as Plassembler (Bouras et al. 2023). minidna can then be used in addition to any long-read-only and/or long-read-first assemblers in order to compensate their potential failure to recover small plasmids (e.g. Johnson et al. 2023).
minidna runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.
| program | package | version | sources |
|---|---|---|---|
| minimap2 | - | ≥ 2.28 | github.com/lh3/minimap2 |
| program | package | version | sources |
|---|---|---|---|
| bzip2 | - | > 1.0.0 | sourceware.org/bzip2/downloads.html |
| DSRC | - | ≥ 2.0 | github.com/refresh-bio/DSRC |
| gzip | - | > 1.5.0 | ftp.gnu.org/gnu/gzip |
| pigz | - | > 2.7 | github.com/madler/pigz |
| dnaapler | - | ≥ 1.3.0 | github.com/gbouras13/dnaapler |
| Flye | - | ≥ 2.9.6 | github.com/mikolmogorov/Flye |
| program | package | version | sources |
|---|---|---|---|
| cat cp du echo mkdir mktemp mv od rm paste sort tr wc |
coreutils★ | > 8.0 | ftp.gnu.org/gnu/coreutils |
| gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
| grep | - | > 2.0 | ftp.gnu.org/gnu/grep |
| sed | - | > 4.2 | ftp.gnu.org/gnu/sed |
★ On OS X, the package coreutils can be installed
via Homebrew
(e.g. brew install coreutils).
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/minidna.gitB. Go to the created directory and give the execute
permission to the file nanodna.sh:
cd minidna/
chmod +x minidna.shC. Check the dependencies (and their version) using the following command line:
./minidna.sh -cIf at least one of the mandatory program (see Dependencies) is not available in your
$PATH variable (or if one compiled binary has a different
default name), it should be manually specified. To specify the location
of a specific binary, edit the file nanodna.sh and indicate
the local path to the corresponding binary(ies) within the code block
DEPENDENCIES (approximately lines 110-180). For each
required program, the table below reports the corresponding variable
assignment instruction to edit (if needed) within the code block
REQUIREMENTS.
| program | variable assignment | program | variable assignment | |
|---|---|---|---|---|
| bzip2★ | BZIP2_BIN=bzip2; |
gzip★ | GZIP_BIN=gzip; |
|
| dnaapler★ | DNAAPLER_BIN=dnaapler; |
gawk | GAWK_BIN=gawk; |
|
| DSRC★ | DSRC_BIN=dsrc; |
pigz★ | PIGZ_BIN=pigz; |
|
| Flye★ | FLYE_BIN=flye; |
minimap2 | MINIMAP2_BIN=minimap2; |
★ Optional programs are not required to be installed/accessible for nanodna to run.
D. Execute minidna with the following command line model:
./minidna.sh [options]Run minidna without option to read the following documentation:
Inferring small circular contigs (< 10,000 bps) from long-read FASTQ files
https://gitlab.pasteur.fr/GIPhy/minidna
USAGE: minidna -i <fastq>[,<fastq>,...] -o <outdir> -b <basename> [options]
OPTIONS:
-i <file> FASTA/FASTQ input file name(s); multiple files can be specified, separated by
commas; FASTA/FASTQ file(s) can be uncompressed (file extensions: fa or fasta /
fq or fastq); FASTQ file(s) can be compressed using either gzip (gz), bzip2 (bz
or bz2), or DSRC (dsrc or dsrc2)
-o <dir> name of the output directory (mandatory option)
-b <string> base name for output files (mandatory option)
-l <int> maximum contig length (default: 10000)
-m <int> minimum coverage depth per contig (default: 30)
-d <int> delta sensitivity parameter (default: 50)
-s subsample large read subsets (default: not set)
-e extensive inference (default: not set)
-p polish contigs (default: not set)
-r reorient contigs (default: not set)
-f output self-circularized reads for each contig (default: not set)
-t <int> thread numbers (default: 12)
-w <dir> path to the tmp directory (default: $TMPDIR, otherwise /tmp)
-c checks dependencies and exit
-V prints version and exit
-h prints this help and exit
minidna can read any number of input files (mandatory
option -i). Input files should be in FASTQ (file extensions
.fq or fastq) or FASTA format (file extensions .fa or .fasta). FASTQ
files can also be compressed using several tools, provided that the file
extension matches the expected one (i.e. bzip2: .bz or .bz2; DSRC: .dsrc or
.dsrc2; gzip:
.gz) and that the corresponding binary is installed.
Putative candidate reads are determined by considering the
different high modes (if any) within the length distribution induced by
the selected reads (i.e. lengths lower than the specified upper-bound;
option -l set to 10,000 bps by default). For each subset of
candidate reads, pairwise alignments are carried out using minimap2 (Li 2018,
2021). Similar and circular reads are next clustered, and the resulting
contig is returned from each read cluster of sufficient size
(i.e. larger than the specified minimum coverage depth; option
-m set to 30× by default). For each read cluster, the
returned contig is simply the read with the maximum Phred score sum
(note that each returned contig is trivially circular, as it is a
circular read).
To always observe accurate inference, it is recommended to use long reads that have been preliminarily trimmed (i.e. cutting low quality ends) and clipped (i.e. cutting technical/artefactual oligonucleotides). Trimming and clipping can be performed using e.g. AlienTrimmer with alien oligonucleotides inferred by AlienDiscover.
The option -e enables to run a more extensive search
of circular contigs. It completes the initial candidate lengths (i.e
corresponding to high modes within the read length distribution, if any)
with every detected local mode. This approach is slower, but allows to
recover those small plasmids that are not significantly overrepresented
within the read dataset (e.g. SRR26162835).
It is worth noting that this extensive inference can sometimes infer
duplicate contigs (e.g. SRR31896212).
The sensitivity of the overall procedure is settled by a unique
parameter Δ (= 50 by default; option -d). It is recommended
to not modify this parameter; decreasing Δ can lead to redundancy in the
inferred contigs (i.e. distinct read subsets corresponding to the same
contig), whereas increasing Δ can yield incorrect contig inferences
(e.g. mixture of circular reads corresponding to different
contigs).
Each inferred contig can be reoriented by setting option
-r, provided that the tool dnaapler
(Bouras et al. 2024) is installed. Each inferred contig can also be
polished using its corresponding circular read set by setting option
-p, provided that the tool Flye (Lin et
al. 2016, Kolmogorov et al. 2019) is installed. Of note, the circular
reads associated to each inferred contig can be written into a FASTQ
file (option -f) in order to use any alternative polishing
tools.
When dealing with large FASTQ files (e.g. > 1 Gbps), it is
recommended to set the option -s to subsample the large
subsets of candidate reads (e.g. > 2,000 reads); option
-s significantly decreases the overall running times, with
generally no impact on the returned contigs (except their estimated
coverage depth). Faster running times can be also observed when using a
large number of threads (option -t set to 12 by default)
and/or by using a temporary directory (option -w) located
on a fast solid-state drive (SSD).
Output files are all defined by the same specified prefix
(mandatory option -b) and written into the specified output
directory (mandatory option -o). Each output file content
is determined by its file extension:
| output file | file content |
|---|---|
| <prefix>.lgt.txt | read length distribution, with max length
determined by option -l |
| <prefix>.fasta | inferred contig(s) in FASTA format |
| <prefix>.info.txt | descriptive statistics of the inferred contigs |
| <prefix>.*.fastq | circular reads associated to each inferred contig in FASTQ format |
In order to illustrate the usefulness of minidna and to better describe its output files, the following example describes its usage for assembling small plasmids of Escherichia coli isolate I from high-quality long-reads generated by a Oxford Nanopore MinION instrument (Lerminiaux et al. 2024). This isolate is expected to contain four small plasmids of diverse lengths, i.e. 5,167 bps (CP135461), 4,074 bps (CP135462), 1,538 bps (CP135463), 1,433 bps (CP135464).
The compressed FASTQ file (750 Mb) can be downloaded using the following command lines:
EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O SRR26162837.fastq.gz $EBIURL/SRR261/037/SRR26162837/SRR26162837_1.fastq.gz ;Below is one standard way to run minidna:
./minidna.sh -i SRR26162837.fastq.gz -o example -b SRR26162837 -p -rOf note, both options -p and -r were set to
infer high-quality contigs (i.e. polished and reoriented,
respectively).
Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies):
# minidna v25.12
# Copyright (C) 2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/minidna
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> NTHREADS=12
> NFILE=1
+ OUTDIR=example
+ TMPDIR=/tmp/minidna.SRR26162837.aRCNnbQPv
[00:00] reading input file ... [ok]
+ [748.4 Mb] SRR26162837.fastq.gz
[00:06] read filtering ... [ok]
> Nreads=88060
> Ncandidates=4
[00:08] processing 4150 reads ~1410 bps ... [ok]
[00:49] inferring circular contig ...... [ok]
+ ctg0001 length=1433 coverage=3199
[00:55] processing 5822 reads ~1510 bps ... [ok]
[02:31] inferring circular contig ...... [ok]
+ ctg0002 length=1538 coverage=4844
[02:40] processing 1781 reads ~4050 bps ... [ok]
[02:56] inferring circular contig ...... [ok]
+ ctg0003 length=4074 coverage=1282
[03:05] processing 2239 reads ~5140 bps ... [ok]
[03:49] inferring circular contig ...... [ok]
+ ctg0004 length=5167 coverage=1853
> Ncontigs=4
[03:58] exit
As shown by the above log, minidna inferred the four small plasmids in ~4 minutes on default 12 threads.
After filtering the long reads to only keep those of length
≤ 10,000 bps (option -l), their length distribution was
computed (see example/SRR26162837.lgt.txt). As expected (and
especially when using rapid ONT preparation kits; Wick et al. 2021), the
presence of small plasmids of length ℓ led to a clear overrepresentation
of long-reads of length near ℓ, as shown by the following graphical
representation.
minidna then considered the four high modes m (i.e. 1,410, 1,510, 4,050 and 5,140 bps), and, for each of them, it gathered the reads of lengths m ± Δ/2. For each subset of gathered reads, quite identical and circular ones were assessed by parwise alignments and next clustered. For each cluster, an accurate read was selected as a complete circular contig, which was next reoriented and polished.
The four inferred contigs are available in example/SRR26162837.fasta. Of note, ctg0001, ctg0002, ctg0003 and ctg0004 are 100% identical to CP135464, CP135463, CP135462 and CP135461, respectively.
Note that the very same inference can be obtained using either option
-e (requiring ~30 more seconds) or option -s
(halving the initial running time).
Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ (2024) Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93):5968. doi:10.21105/joss.05968
Bouras G, Sheppard AE, Mallawaarachchi V, Vreugde S (2023) Plassembler: an automated bacterial plasmid assembly tool. Bioinformatics, 39(7):btad409. doi:10.1093/bioinformatics/btad409
Johnson J, Soehnlen M, Blankenship HM (2023) Long read genome assemblers struggle with small plasmids. Microbial Genomics, 9(5):001024. doi:10.1099/mgen.0.001024
Kolmogorov M, Yuan J, Lin Y, Pevzner P (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, 37:540-546. doi:10.1038/s41587-019-0072-8
Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L (2024) Do we still need Illumina sequencing data? Evaluating Oxford Nanopore Technologies R10.4.1 flow cells and the Rapid v14 library prep kit for Gram negative bacteria whole genome assemblies. Canadian Journal of Microbioly, 70(5):178-189. doi:10.1139/cjm-2023-0175
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
Li H (2021) New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705
Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner P (2016) Assembly of Long Error-Prone Reads Using de Bruijn Graphs, Proceedings of the National Academy of Sciences of the United States of America, 113(52):E8396-E8405. doi:10.1073/pnas.1604560113
Wick RR, Judd LM, Wyres KL, Holt KE (2021) Recovery of small plasmid sequences via Oxford Nanopore sequencing. Microbial Genomics, 7:000631. doi:10.1099/mgen.0.000631