Dependencies | Installation and execution | Usage | Notes | References | Citations
fq2dna (FASTQ files to de novo assembly) is a command line tool written in Bash to ease the de novo assembly of archaea, bacteria or virus genomes from raw high-throughput sequencing (HTS) paired-end (PE) reads.
Every data pre- and post-processing step is managed by fq2dna (e.g. HTS read filtering and enhancing, well-tuned de novo assemblies, scaffold sequence accuracy assessment). The main purpose of fq2dna is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 5 to 20 minutes to deal with a bacteria HTS sample using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.
fq2dna runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.
| program | package | version | sources |
|---|---|---|---|
| bwa-mem2 | - | ≥ 2.2.1 | github.com/bwa-mem2/bwa-mem2 |
| contig_info | - | ≥2.1 | gitlab.pasteur.fr/GIPhy/contig_info |
| CoPro | - | ≥ 0.3 | gitlab.pasteur.fr/GIPhy/CoPro |
| FASTA2AGP | - | ≥ 2.0 | gitlab.pasteur.fr/GIPhy/FASTA2AGP |
| fqCleanER | - | ≥ 23.12 | gitlab.pasteur.fr/GIPhy/fqCleanER |
| fqstats | fqtools | ≥ 1.2 | ftp.pasteur.fr/pub/gensoft/projects/fqtools |
| ntCard | - | > 1.2 | github.com/bcgsc/ntCard |
| samtools | - | ≥ 1.18 | github.com/samtools/samtools sourceforge.net/projects/samtools |
| SPAdes | - | ≥ 3.15.5 | github.com/ablab/spades |
| program | package | version | sources |
|---|---|---|---|
| Bakta | - | ≥ 1.11.4 | github.com/oschwengers/bakta |
| Platon | - | ≥ 1.6 | github.com/oschwengers/platon |
| program | package | version | sources |
|---|---|---|---|
| cat cp du echo md5sum mkdir mktemp mv paste rm seq sort tr wc |
coreutils★ | > 8.0 | ftp.gnu.org/gnu/coreutils |
| gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
| grep | - | > 2.0 | ftp.gnu.org/gnu/grep |
| sed | - | > 4.2 | ftp.gnu.org/gnu/sed |
★ On OS X, the package coreutils can be
installed via Homebrew
(e.g. brew install coreutils).
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fq2dna.gitB. Go to the created directory and give the execute
permission to the file fq2dna.sh:
cd fq2dna/
chmod +x fq2dna.shC. Check the dependencies (and their version) using the following command line:
./fq2dna.sh -d$PATH variable (or if one compiled binary has a different
default name), it should be manually specified. To specify the location
of a specific binary, edit the file fq2dna.sh and indicate
the local path to the corresponding binary(ies) within the code block
DEPENDENCIES (approximately lines 150-210). For each
required program, the table below reports the corresponding variable
assignment instruction to edit (if needed) within the code block
DEPENDENCIES
| program | variable assignment | program | variable assignment | |
|---|---|---|---|---|
| Bakta★ | BAKTA_BIN=bakta; |
fqstats | FQSTATS_BIN=fqstats; |
|
| bwa-mem2 | BWAMEM2_BIN=bwa-mem2; |
gawk | GAWK_BIN=gawk; |
|
| contig_info | CONTIG_INFO_BIN=contig_info; |
ntcard | NTCARD_BIN=ntcard; |
|
| CoPro | COPRO_BIN=CoPro; |
Platon★ | PLATON_BIN=platon; |
|
| FASTA2AGP | FASTA2AGP_BIN=FASTA2AGP; |
samtools | SAMTOOLS_BIN=samtools; |
|
| fqCleanER | FQCLEANER_BIN=fqCleanER; |
SPAdes | SPADES_BIN=spades.py; |
★ Optional programs are not required to be installed/accessible for fq2dna to run.
Note that depending on the installation of some required programs,
the corresponding variable can be assigned with complex commands. For
example, as FASTA2AGP is a Java tool that can be run using a
Java virtual machine, the executable jar file FASTA2AGP.jar
can be used by fq2dna by editing the corresponding variable
assignment instruction as follows:
FASTA2AGP_BIN="java -jar $PathTo/FASTA2AGP.jar".
D. Execute fq2dna with the following command line model:
./fq2dna.sh [options]Run fq2dna without option to read the following documentation:
USAGE: fq2dna.sh [options]
Processing and assembling high-throughput sequencing (HTS) paired-end (PE) reads:
+ processing HTS reads using different steps: deduplicating [D], trimming/clipping [T], error
correction [E], contaminant removal [C], merging [M], and/or digital normalization [N]
+ de novo assembly [dna] of the whole genome from processed HTS reads, followed by polishing
and annotation steps (depending on the specified strategy)
OPTIONS:
-1 <infile> fwd (R1) FASTQ input file name from PE library 1 || input files can be compressed
-2 <infile> rev (R2) FASTQ input file name from PE library 1 || using either gzip (file
-3 <infile> fwd (R1) FASTQ input file name from PE library 2 || extension .gz), bzip2 (.bz or
-4 <infile> rev (R2) FASTQ input file name from PE library 2 || .bz2), or DSRC (.dsrc or
-5 <infile> fwd (R1) FASTQ input file name from PE library 3 || .dsrc2), or uncompressed (.fq
-6 <infile> rev (R2) FASTQ input file name from PE library 3 || or .fastq)
-o <outdir> path and name of the output directory (mandatory option)
-b <string> base name for output files (mandatory option)
-s <char> to set a predefined strategy for processing HTS reads among the following ones:
A Archaea: DT(C)E + [N|MN] + dna + polishing + annotation
B Bacteria: DT(C)E + [N|MN] + dna + polishing + annotation
E Eukaryote: DT(C) + [N|MN] + dna
G Germs: DT(C)E + [N|MN] + dna + polishing + annotation + classification
P Prokaryote: DT(C)E + [N|MN] + dna + polishing
S Standard: DT(C) + [N|MN] + dna + polishing
V Virus: DT(C)E + [N|MN] + dna + polishing + annotation
(default: S)
-L <int> minimum required length for a contig (default: 300)
-T <"G S I"> Genus (G), Species (S) and Isolate (I) names to be used during annotation step;
should be set between quotation marks and separated by a blank space; only with
options -s A, -s B, -s G or -s V (default: "Genus sp. STRAIN")
-q <int> quality score threshold; all bases with Phred score below this threshold are
considered as non-confident during step [T] (default: 20)
-l <int> minimum required length for a read (default: half the average read length)
-p <int> maximum allowed percentage of non-confident bases (as ruled by option -q) per
read (default: 50)
-c <int> minimum allowed coverage depth during step [N] (default: 2)
-C <int> maximum allowed coverage depth during step [N] (default: 60)
-a <infile> to set a file containing every alien oligonucleotide sequence (one per line) to
be clipped during step [T] (see below)
-a <string> one or several key words (separated with commas), each corresponding to a set of
alien oligonucleotide sequences to be clipped during step [T]:
POLY nucleotide homopolymers
NEXTERA Illumina Nextera index Kits
IUDI Illumina Unique Dual index Kits
AMPLISEQ AmpliSeq for Illumina Panels
TRUSIGHT_PANCANCER Illumina TruSight RNA Pan-Cancer Kits
TRUSEQ_UD Illumina TruSeq Unique Dual index Kits
TRUSEQ_CD Illumina TruSeq Combinatorial Dual index Kits
TRUSEQ_SINGLE Illumina TruSeq Single index Kits
Note that these sets of alien sequences are not exhaustive and will never replace
the exact oligos used for library preparation (default: "POLY")
-a AUTO to infer 3' alien oligonucleotide sequence(s) for step [T]; inferred oligo(s) are
completed with those from "POLY" (see above)
-A <infile> to set sequence or k-mer model file(s) to carry out contaminant read removal
during step [C]; several comma-separated file names can be specified; allowed
file extensions: .fa, .fasta, .fna, .kmr or .kmz
-e extensive de novo assembly (slow; default: not set)
-t <int> number of threads (default: 12)
-w <dir> path to tmp directory (default: $TMPDIR, otherwise /tmp)
-x to not remove (after completing) the tmp directory inside the one set with option
-w (default: not set)
-d check dependencies and exit
-h print this help and exit
EXAMPLES:
fq2dna.sh -1 r1.fastq -2 r2.fastq -o out -b ek12 -s G -T "Escherichia coli K12" -a NEXTERA
fq2dna.sh -1 rA.1.fq -2 rA.2.fq -3 rB.1.fq.gz -4 rB.2.fq.gz -o out -b name -a AUTO
fq2dna.sh -1 hts.1.fq.gz -2 hts.2.fq.gz -o out -b bact -s B -a IUDI -t 24 -e
fq2dna is able to consider up to three paired-ends
libraries (PE; options -1 to -6). FASTQ input
files are read by fqCleanER;
they can be compressed using gzip, bzip2 or DSRC (Roguski
and Deorowicz 2014).
In brief, fq2dna first performs initial basic HTS read
preprocessing (i.e. step I) using fqCleanER,
i.e. deduplication (D), trimming/clipping (T; as ruled by options
-a, -q, -l and -p)
and error correction (E); when specified (option -A), a
contaminating HTS read removal step (C) can also be performed.
Next,
fq2dna creates two distinct datasets from the preprocessed HTS
reads: a first one (i.e. step N) obtained using a
digital normalization procedure (N; as ruled by options
-C and -c), and a second one (i.e. step
M) by merging the PE HTS reads with short insert size
(M) followed by a digital normalization procedure. Each
of these two HTS datasets is used to infer a de novo genome
assembly (dnaN and dnaM, respectively) using SPAdes
(Bankevich et al. 2012).
Among the two genome assemblies, the less
fragmented one is retained, and next polished (i.e. correcting putative
local assembly errors such as mismatches and short indels) using samtools on the aligned step
I HTS reads. Polished scaffolds are then used together
with their generating (step N or M)
HTS reads to infer a genome coverage profile using CoPro. Based
on the coverage confidence interval derived from the coverage profile,
insufficiently covered scaffold sequences are discarded, whereas the
remaining ones are enhanced.
Depending on the specified strategy
(option -s), selected scaffold sequences can be classified
as chromosome/plasmid/undetermined (i.e. CPU) using Platon
(Schwengers et al. 2020) and/or annotated using Bakta
(Schwengers et al. 2021), provided that these two optional tools are
available.
Is is generally not recommended to modify the default values for
trimming options -q , -l and -p.
Increasing options -q and -l and/or decreasing
option -p can negatively affect the local coverage of some
sequenced regions.
Use option -a AUTO each time you are unaware of the
DNA library preparation kit. Setting -a AUTO will run the
tool AlienDiscover
to infer the technical oligonucleotide occurring in 3’ end (if
any).
By default, fq2dna generally assembles scaffold
sequences with great N50 and auN accuracy metrics when run on
high-quality input data (e.g. high %Q20, coverage depth ≥ 100×).
However, you can use option -e to run SPAdes
extensively, but at the cost of longer running times.
Temporary files are written into a dedicated directory created
into the $TMPDIR directory (when defined, otherwise
/tmp/). When possible, it is highly recommended to set a
temp directory with large capacity (option -w).
Output files are all defined by the same specified prefix
(mandatory option -b) and written into a specified output
directory (mandatory option -o). Each output file content
is determined by its file extension:
| output file | file content |
|---|---|
| <prefix>.stepI.log | fqCleanER log file of the step I |
| <prefix>.stepM.log | fqCleanER log file of the step M |
| <prefix>.stepN.log | fqCleanER log file of the step N |
| <prefix>.all.fasta | all the scaffolds assembled by SPAdes (FASTA format) |
| <prefix>.cov.info.txt | genome coverage profile summary generated using CoPro |
| <prefix>.scf.fasta | selected and polished scaffold sequences (FASTA format) |
| <prefix>.scf.info.txt | residue content of the selected scaffold sequences (tab-delimited) |
| <prefix>.scf.amb.txt | ambiguously assembled bases (tab-delimited) |
| <prefix>.agp.fasta | contigs derived from the scaffold sequences in <prefix>.scf.fasta (FASTA format) |
| <prefix>.agp | scaffolding information associated to <prefix>.agp.fasta (AGP format) |
| <prefix>.dna.info.txt | descriptive statistics of each FASTA file content (tab-delimited) |
| <prefix>.isd.txt | descriptive statistics of the insert size distribution for each specified PE FASTQ files |
| <prefix>.QCsum.txt | quality control metrics (tab-delimited) |
| <prefix>.gbk | annotated scaffold sequences (GenBank flat file format) |
| <prefix>.gbk.info.txt | annotation statistics generated by Bakta |
A large set of descriptive statistics and metrics are available in the some output files. The descriptive statistics of the three FASTA files are written in <prefix>.dna.info.txt (e.g. no. Nseq and Nres of assembled sequences and bases, respectively, auN, N50,…). The residue content of each final scaffold sequence is written in <prefix>.scf.info.txt (e.g. length, GC-content, …). Statistics associated to the genome coverage profile analysis are summarized in <prefix>.cov.info.txt (e.g. low coverage cutoff).
In order to quickly assess the overall accuracy of the assembled genome, different quality control metrics are summarized in <prefix>.QCsum.txt:
| QCsum field | QCsum field content |
|---|---|
| name | sample name (option -b) |
| Npe | number of PE reads |
| Nbps | number of sequenced bases |
| ISavg | average insert size |
| covI | initial coverage depth |
| Nctg | number of assembled contigs |
| Nres | total length of the assembled contigs |
| %GC | GC-content |
| Namb | number of ambiguous positions in the assembled scaffolds |
| N50 | N50 of the assembled contigs |
| auN | auN of the assembled contigs |
| momo | mono-modality index of the coverage profile |
| R2 | R2 metric derived from the coverage profile |
| covE | expected coverage depth of the genome
assembly (option -C) |
| covO | observed coverage depth of the genome assembly |
| covF | mean of the theoretical distribution fitted to the coverage profile |
Default options lead to a de novo assembly inferred from
a subset of high-quality HTS reads corresponding to 60× coverage depth
(option -C), which is a good tradeoff to observe accurate
results with fast running times. It is therefore expected that the
resulting genome coverage profile (output file
<prefix>.cov.info.txt) is a unimodal histogram with both mean
(covO) and mode close to the default 60×. Strong deviation from
the specified coverage depth (option -C) can be indicative
of a sequencing problem, e.g. sample contamination (bimodal histogram,
assessed by e.g. momo ≤ 0.99), low coverage depth (i.e. small
covO and/or covF), locally insufficient coverage depth
(negative skewness, assessed by e.g. R2 ≤ 0.6). For more
details about these quality control metrics, see the documentation of CoPro.
Of important note is that the genome coverage profile
(<prefix>.cov.info.txt) is also used to fit a (Generalized
Poisson; GP) theoretical distribution. Such a GP distribution enables to
determine a coverage depth cutoff, under which any observed coverage
depth is considered as significantly low (p-value ≤ 0.00005;
for more details, see the documentation of CoPro). Every
assembled base with coverage depth lower than the estimated coverage
cutoff (low in <prefix>.cov.info.txt) is written in
lowercase into <prefix>.scf.fasta and should be considered with
caution. If the selected scaffold sequences (i.e. with average coverage
depth greater than the low coverage cutoff) do not seems to
represent the whole genome, try to consider the entire set of assembled
scaffolds (output file <prefix>.all.fasta) and/or rerun
fq2dna with smaller expected coverage (option
-C).
An ambiguous base N is a multi-allelic position that is sufficiently covered by its generating HTS reads. These Namb bases are listed in the tab-delimited output file <prefix>.scf.amb.txt (one per line) together with an estimate of the sequenced content corresponding to this assembled position (i.e. proportions of A, C, G, T and gaps, respectively, estimated from the alignment of the generating HTS reads). It is worth noting that a large number Namb of ambiguous bases can sometimes be the result of a contaminated dataset.
For more details about the different assembly and scaffold sequence statistics (output files <prefix>.dna.info.txt and <prefix>.scf.info.txt, respectively), see the documentation of contig_info.
For more details about the chromosome/plasmid/undetermined
classification of each scaffold sequence (i.e. column CPU in
<prefix>.scf.info.txt when using option -s G), see
the documentation of Platon.
For more details about the annotation procedure when using
strategies A, B, G or V (option -s; output files
<prefix>.gbk and <prefix>.gbk.info.txt), see the
documentation of Bakta.
In order to illustrate the usefulness of fq2dna and to better describe its output files, the following use case example describes its usage for (re)assembling the draft genome of Listeria monocytogenes 2HF33 (Duru et al. 2020).
All output files are available in the directory example/ (the four large sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).
Downloading input files
Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (112 Mb and 128 Mb, respectively) can be downloaded using the following command lines:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_2.fastq.gzAs phiX genome was used as spike-in during Illumina sequencing (Duru et al. 2020), this putative contaminating sequence (5.4 kb) can be downloded using the following command line:
wget -O phiX.fasta "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_001422.1"Running fq2dna
Below is a typical command line to run fq2dna on PE FASTQ files to deal with bacteria HTS data:
./fq2dna.sh -1 ERR4032786_1.fastq.gz -2 ERR4032786_2.fastq.gz -a AUTO -A phiX.fasta -s G \\
-o 2HF33 -b Lm.2HF33 -T "Listeria monocytogenes 2HF33" -t 12 -w /tmpOf note, option -a AUTO was set to let fq2dna
infer the technical oligonucleotide used during library preparation,
whereas the putative contaminating sequence file phiX.fasta was
set using option -A. The strategy ‘Germs’ was set (option
-s G) in order to also perform both scaffold annotation and
classification.
Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using fq2dna together with program and tool versions listed into example/program.versions.txt:
# fq2dna v25.12
# Copyright (C) 2016-2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/fq2dna
> System: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
# input file(s):
> PE lib 1
+ FQ11: [111.9 Mb] 541e697e87d7d80a70d29d3827f98d6d ERR4032786_1.fastq.gz
+ FQ12: [127.9 Mb] 651a7443ba56bb52cdf1b43ba9e9d844 ERR4032786_2.fastq.gz
# output directory
+ OUTDIR=2HF33
# tmp directory
+ TMP_DIR=/tmp/fq2dna.Lm.2HF33.lh617yARn
# STRATEGY G: Germs
[00:01] Deduplicating, Trimming, Clipping, Decontaminating and Correcting reads ... [ok]
[02:46] Merging and/or Normalizing reads ... [ok]
[04:01] Approximating genome size ... [ok]
> 3112666 bps (N=3112091 M=3113241)
[04:05] Assembling genome with/without PE read merging (dnaM/dnaN) .... [ok]
> [dnaN] covr=62 arl=285 k=21,33,55,77,99,121
> [dnaM] covr=60 arl=281 k=21,33,55,77,99,121
[06:26] Comparing genome assemblies ... [ok]
> [dnaN] Nseq=31 Nres=3099160 NG50=361897 auGN=462498
> [dnaM] Nseq=34 Nres=3099305 NG50=452690 auGN=346318
> selecting dnaN
> [dna] Nseq=31 Nres=3099160 N50=361897 auN=464514
[06:27] Aligning PE reads against scaffolds .... [ok]
[06:40] Polishing scaffolds ...
> [dna] Nseq=31 Nres=3099288 NG50=361909 auGN=462513
> [dna] Nseq=30 Nres=3099356 NG50=361921 auGN=462523
> [dna] Nseq=30 Nres=3099356 N50=361921 auN=464506
[07:24] Processing scaffolds ... [ok]
> expect. cov. 60
> avg. cov. 60.45
> cov. mode 60
> momo index 0.99913539
> accuracy 0.878995
> min. cov. cutoff 43
> ambiguous pos. 5
> [dna] Nseq=23 Nres=3096373 N50=361921 auN=464956
[07:43] Annotating scaffold sequences ... [ok]
> taxon: Listeria monocytogenes 2HF33
> locus tag: LISMON2HF33
# output files:
+ de novo assembly (all scaffolds): 2HF33/Lm.2HF33.all.fasta
+ coverage profile summary: 2HF33/Lm.2HF33.cov.info.txt
+ de novo assembly (selected scaffolds): 2HF33/Lm.2HF33.scf.fasta
+ sequence stats (selected scaffolds): 2HF33/Lm.2HF33.scf.info.txt
+ ambiguous positions (selected scaffolds): 2HF33/Lm.2HF33.scf.amb.txt
+ de novo assembly (selected contigs): 2HF33/Lm.2HF33.agp.fasta
+ scaffolding info: 2HF33/Lm.2HF33.agp
+ descriptive statistics: 2HF33/Lm.2HF33.dna.info.txt
+ insert size statistics: 2HF33/Lm.2HF33.isd.txt
+ annotation (selected scaffolds): 2HF33/Lm.2HF33.scf.gbk
+ annotation info: 2HF33/Lm.2HF33.scf.gbk.info.txt
+ quality control summary metrics: 2HF33/Lm.2HF33.QCsum.txt
[16:15] exit
This log output shows that the complete analysis was performed on 12 threads in ~16 minutes (HTS read processing in ~4 minutes, de novo assemblies in ~2 minutes, scaffold sequence post-processing in ~2 minutes, annotation and classification in ~8 minutes). One can also observe that the selected de novo assembly was dnaN (i.e. lower number of contigs, greater auGN metric), therefore showing that the PE HTS read merging step (M) does not always enable to obtain better genome assemblies. The final genome assembly has a total size of 3.09 Mbps, represented in 23 contigs, with N50=361,921 bps (for comparison sake, the assembly accession of the original assembly is GCF_902838845).
Output files
The genome coverage profile summary file
(Lm.2HF33.cov.info.txt; see excerpt below) shows an average
coverage depth of ~60×, corresponding to a symmetrical coverage depth
distribution with mode 60×, identical to the expected value (default
option -C 60). The momo index is 0.9991 (above the
mono-modality cutoff, e.g. 0.99), therefore suggesting that no
contamination occurred in the HTS reads selected for assembly. The
fitted Generalized Poisson (GP) distribution has a mean of 60.09, and
its derived metric R2 (= 0.8181) suggests an expected uniform
coverage across the assembled positions.
== observed coverage distribution
no.pos 3101232
avg 60.45
mode 60
momo 0.99913539
** GP(lambda,theta) distribution
lambda 109.26055591
theta -0.81815798
mean 60.09
R2 0.87975610
-- confidence interval (p-value=0.00005)
low 43
sup 76
These different statistics therefore assess that the assembled sequences are accurate and well-supported by sufficient data. Finally, the GP distribution enables to determine the low coverage cutoff (i.e. 43×) under which any assembled base is considered as putatively not trustworthy (lowercase characters in the scaffold sequence file Lm.2HF33.scf.fasta).
The tab-delimited file Lm.2HF33.dna.info.txt shows that the de novo assembly procedure led to quite few scaffold and contig sequences:
#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg auN N50 N75 N90 L50 L75 L90
Lm.2HF33.all.fasta 35 3101208 943528 604087 568021 985272 300 30.42% 19.47% 18.31% 31.77% 0.00% 62.21% 37.79% 237 372 1664 99066 1091349 88605.94 559194 532745 125907 97509 2 6 10
Lm.2HF33.scf.fasta 21 3096552 942169 603187 566991 984026 179 30.42% 19.47% 18.31% 31.77% 0.00% 62.21% 37.79% 628 4992 64421 125913 1091380 147454.85 560067 532745 125913 97509 2 6 10
Lm.2HF33.agp.fasta 23 3096385 942169 603187 566991 984026 12 30.42% 19.48% 18.31% 31.77% 0.00% 62.21% 37.79% 628 4992 64421 126576 935183 134625.43 464956 361921 125913 90129 3 7 11
Of note, 14 small (min. length = 300 bps; option -L)
and/or insufficiently covered sequences (cutoff = 43×; see
Lm.2HF33.cov.info.txt) were not selected for the final scaffold
sequence set (i.e. Lm.2HF33.scf.fasta), therefore leading to
the final assembled genome of total length Nres = 3,096,552
bps, represented in Nseq = 21 scaffold sequences (scaffold and
contig N50 = 532,745 and 361,921 bps, respectively).
The tab-delimited file Lm.2HF33.scf.info.txt (displayed below) shows different residue statistics for each final scaffold sequence. Note that each FASTA header info file Lm.2HF33.scf.fasta contains the k-mer coverage depth returned by SPAdes (covk), the read coverage depth estimated by fq2dna (covr), and the low coverage cutoff derived from the coverage profile analysis (cutoff).
#Seq Nres A C G T N %A %C %G %T %N %AT %GC Pval CPU
NODE_0001 1091380 330567 212931 195684 352105 93 30.28% 19.51% 17.92% 32.26% 0.00% 62.56% 37.44% 0.0116 C
NODE_0002 532745 154855 110898 91560 175432 0 29.06% 20.81% 17.18% 32.92% 0.00% 62.00% 38.00% 0.6892 C
NODE_0003 361921 113236 65663 74544 108478 0 31.28% 18.14% 20.59% 29.97% 0.00% 61.27% 38.73% 0.0032 C
NODE_0004 206455 68177 35148 42251 60874 5 33.02% 17.02% 20.46% 29.48% 0.00% 62.51% 37.49% 0.0091 C
NODE_0005 126576 37673 26289 21299 41315 0 29.76% 20.76% 16.82% 32.64% 0.00% 62.41% 37.59% 0.2115 C
NODE_0006 125913 41257 22119 25733 36804 0 32.76% 17.56% 20.43% 29.22% 0.00% 62.00% 38.00% 0.8299 C
NODE_0007 109341 32663 22787 17974 35916 1 29.87% 20.84% 16.43% 32.84% 0.00% 62.73% 37.27% 0.0147 C
NODE_0008 103533 29753 21604 16750 35350 76 28.73% 20.86% 16.17% 34.14% 0.07% 62.93% 37.07% 0.0003 C
NODE_0009 99090 32747 17157 20514 28672 0 33.04% 17.31% 20.70% 28.93% 0.00% 61.99% 38.01% 0.9410 C
NODE_0010 97509 28253 20285 17248 31723 0 28.97% 20.80% 17.68% 32.53% 0.00% 61.51% 38.49% 0.0245 C
NODE_0011 64421 21749 11088 13365 18219 0 33.76% 17.21% 20.74% 28.28% 0.00% 62.05% 37.95% 0.6433 C
NODE_0012 61185 17687 12639 9717 21142 0 28.90% 20.65% 15.88% 34.55% 0.00% 63.47% 36.53% 0.0000 P
NODE_0013 46823 14085 9566 7968 15204 0 30.08% 20.43% 17.01% 32.47% 0.00% 62.56% 37.44% 0.1143 C
NODE_0014 35062 10004 7708 6005 11345 0 28.53% 21.98% 17.12% 32.35% 0.00% 60.89% 39.11% 0.0229 C
NODE_0015 22536 6000 4813 3643 8080 0 26.62% 21.35% 16.16% 35.85% 0.00% 62.48% 37.52% 0.1343 U
NODE_0016 4992 1395 1052 1470 1073 2 27.94% 21.07% 29.44% 21.49% 0.04% 49.46% 50.54% 0.0000 U
NODE_0017 2806 724 661 386 1035 0 25.80% 23.55% 13.75% 36.88% 0.00% 62.69% 37.31% 0.4654 U
NODE_0018 1664 524 292 266 582 0 31.49% 17.54% 15.98% 34.97% 0.00% 66.47% 33.53% 0.0078 U
NODE_0019 1126 398 199 264 263 2 35.34% 17.67% 23.44% 23.35% 0.17% 58.81% 41.19% 0.1331 U
NODE_0020 846 182 200 236 228 0 21.51% 23.64% 27.89% 26.95% 0.00% 48.47% 51.53% 0.0017 U
NODE_0021 628 240 88 114 186 0 38.21% 14.01% 18.15% 29.61% 0.00% 67.84% 32.16% 0.0219 U
The last column CPU also shows the classification of each scaffold
sequence into the category ‘Chromosome’, ‘Plasmid’ or ‘Undetermined’
(i.e. C, P, U, respectively), as assessed by Platon
via the selected strategy -s G. Such a classification shows
that the 12th scaffold sequence (i.e. NODE_0012) is likely
a plasmid (subsequent BLAST searches indeed show that it is almost
identical to pLM6179).
The tab-delimited file Lm.2HF33.scf.amb.txt (displayed below) lists the Namb = 5 assembled positions that are likely multi-allelic.
#seq lgt pos base %A %C %G %T %gap
NODE_0001 1091380 481973 N 50.0 0 50.0 0 0
NODE_0004 206455 203837 N 0 39.6 0 60.4 0
NODE_0004 206455 203838 N 0 42.6 0 57.4 0
NODE_0004 206455 203840 N 59.6 0 40.4 0 0
NODE_0004 206455 204068 N 37.0 0 63.0 0 0
Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477. doi:10.1089/cmb.2012.0021.
Duru IC, Andreevskaya M, Laine P, Rode TN, Ylinen A, Løvdal T, Bar N, Crauwels P, Riedel CU, Bucur FI, Nicolau AI, Auvinen P (2020) Genomic characterization of the most barotolerant Listeria monocytogenes RO15 strain compared to reference strains used to evaluate food high pressure processing. BMC Genomics, 21:455. doi:10.1186/s12864-020-06819-0.
Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147.
Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.
Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A (2020) Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10):mgen000398. doi:10.1099/mgen.0.000398.
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A (2021) Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11):000685. doi:10.1099/mgen.0.000685.
Abou Fayad A, Rafei R, Njamkepo E, Ezzeddine J, Hussein H, Sinno S, Gerges J-R, Barada S, Sleiman A, Assi M, Baakliny M, Hamedeh L, Mahfouz R, Dabboussi F, Feghali R, Mohsen Z, Rady A, Ghosn N, Abiad F, Abubakar A, Barakat A, Wauquier N, Quilici M-L, Hamze M, Weill F-X, Matar GM (2024) An unusual two-strain cholera outbreak in Lebanon, 2022-2023: a genomic epidemiology study. Nature Communications, 15:6963. doi:10.1038/s41467-024-51428-0
Crestani C, Arcari G, Landier A, Passet V, Garnier D, Brémont S, Armatys N, Carmi-Leroy A, Toubiana J, Badell E, Brisse S (2023) Corynebacterium ramonii sp. nov., a novel toxigenic member of the Corynebacterium diphtheriae species complex. Research in Microbiology, 174(7):104113. doi:10.1016/j.resmic.2023.104113
Crestani C, Passet V, Rethoret-Pasty M, Zidane N, Brémont S, Badell E, Criscuolo A, Brisse S (2025) Microevolution and genomic epidemiology of the diphtheria-causing zoonotic pathogen Corynebacterium ulcerans. Nature Communications, 16:4843. doi:10.1038/s41467-025-60065-0
Crippa C, Pasquali F, Rodrigues C, De Cesare A, Lucchi A, Gambi L, Manfreda G, Brisse S, Palma F (2023) Genomic features of Klebsiella isolates from artisanal ready-to-eat food production facilities. Scientific Reports, 13:10957. doi:10.1038/s41598-023-37821-7
Hawkey J, Frézal L, Tran Dien A, Zhukova A, Brown D, Chattaway MA, Simon S, Izumiya H, Fields PI, De Lappe N, Kaftyreva L, Xu X, Isobe J, Clermont D, Njamkepo E, Akeda Y, Issenhuth-Jeanjean S, Makarova M, Wang Y, Hunt M, Jenkins BM, Ravel M, Guibert V, Serre E, Matveeva Z, Fabre L, Cormican M, Yue M, Zhu B, Morita M, Iqbal Z, Nodari CS, Pardos de la Gandara M, Weill F-X (2024) Genomic perspective on the bacillus causing paratyphoid B fever. Nature Communications, 15:10143. doi:10.1038/s41467-024-54418-4
Kämpfer P, Glaeser SP, Busse H-J, McInroy JA, Clermont D, Criscuolo A (2022) Pseudoneobacillus rhizosphaerae gen. nov., sp. nov., isolated from maize root rhizosphere. International Journal of Systematic and Evolutionary Biology, 72:5. doi:10.1099/ijsem.0.005367
Kämpfer P, Lipski A, Lamothe L, Clermont D, Criscuolo A, McInroy JA, Glaeser SP (2022) Paenibacillus allorhizoplanae sp. nov. from the rhizoplane of a Zea mays root. Archives of Microbiology, 204:630. doi:10.1007/s00203-022-03225-w
Kämpfer P, Lipski A, Lamothe L, Clermont D, Criscuolo A, McInroy JA, Glaeser SP (2023) Paenibacillus plantiphilus sp. nov. from the plant environment of Zea mays. Antonie van Leeuwenhoek, 116:883-892. doi:10.1007/s10482-023-01852-x
Kämpfer P, Lipski A, McInroy JA, Clermont D, Criscuolo A, Glaeser SP (2022) Bacillus rhizoplanae sp. nov. from maize roots. International Journal of Systematic and Evolutionary Biology, 72:7. doi:10.1099/ijsem.0.005450
Kämpfer P, Glaeser SP, McInroy JA, Busse H-J, Clermont D, Criscuolo A (2024) Description of Cohnella rhizoplanae sp. nov., isolated from the root surface of soybean (Glycine max). Antonie van Leeuwenhoek, 118:41. doi:10.1007/s10482-024-02051-y
Palma F, Mangone I, Janowicz A, Moura A, Chiaverini A, Torresi M, Garofolo G, Criscuolo A, Brisse S, Di Pasquale A, Cammà C, Radomski N (2022) In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes. BMC Genomics, 23:235. doi:10.1186/s12864-022-08437-4
Rahi P, Mühle E, Scandola C, Touak G, Clermont D (2024) Genome sequence-based identification of Enterobacter strains and description of Enterobacter pasteurii sp. nov. Microbiology Spectrum, 12:e03150-23. doi:10.1128/spectrum.03150-23
Shelomi M, Han C-J, Chen W-M, Chen H-K, Liaw S-J, Mühle E, Clermont5 D (2023) Chryseobacterium oryctis sp. nov., isolated from the gut of the beetle Oryctes rhinoceros, and Chryseobacterium kimseyorum sp. nov., isolated from a stick insect rearing cage. International Journal of Systematic and Evolutionary Biology, 73:4. doi:10.1099/ijsem.0.005813
Vautrin N, Alexandre K, Pestel-Caron M, Bernard E, Fabre R, Leoz M, Dahyot S, Caron F (2023) Contribution of Antibiotic Susceptibility Testing and CH Typing Compared to Next-Generation Sequencing for the Diagnosis of Recurrent Urinary Tract Infections Due to Genetically Identical Escherichia coli Isolates: a Prospective Cohort Study of Cystitis in Women. Microbiology Spectrum, 11(4):e02785-22. doi:10.1128/spectrum.02785-22