Dependencies | Installation and execution | Usage | Notes | References | Citations


GPLv3 license Bash publication

fq2dna

fq2dna (FASTQ files to de novo assembly) is a command line tool written in Bash to ease the de novo assembly of archaea, bacteria or virus genomes from raw high-throughput sequencing (HTS) paired-end (PE) reads.

Every data pre- and post-processing step is managed by fq2dna (e.g. HTS read filtering and enhancing, well-tuned de novo assemblies, scaffold sequence accuracy assessment). The main purpose of fq2dna is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 5 to 20 minutes to deal with a bacteria HTS sample using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.

fq2dna runs on UNIX, Linux and most OS X operating systems.

fq2dna

Dependencies

You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.

Mandatory programs
program package version sources
bwa-mem2 - ≥ 2.2.1 github.com/bwa-mem2/bwa-mem2
contig_info - ≥2.1 gitlab.pasteur.fr/GIPhy/contig_info
CoPro - ≥ 0.3 gitlab.pasteur.fr/GIPhy/CoPro
FASTA2AGP - ≥ 2.0 gitlab.pasteur.fr/GIPhy/FASTA2AGP
fqCleanER - ≥ 23.12 gitlab.pasteur.fr/GIPhy/fqCleanER
fqstats fqtools ≥ 1.2 ftp.pasteur.fr/pub/gensoft/projects/fqtools
ntCard - > 1.2 github.com/bcgsc/ntCard
samtools - ≥ 1.18 github.com/samtools/samtools
sourceforge.net/projects/samtools
SPAdes - ≥ 3.15.5 github.com/ablab/spades
Optional programs
program package version sources
Bakta - ≥ 1.11.4 github.com/oschwengers/bakta
Platon - ≥ 1.6 github.com/oschwengers/platon
Standard GNU packages and utilities
program package version sources
cat
cp
du
echo
md5sum
mkdir
mktemp
mv
paste
rm
seq
sort
tr
wc
coreutils > 8.0 ftp.gnu.org/gnu/coreutils
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
grep - > 2.0 ftp.gnu.org/gnu/grep
sed - > 4.2 ftp.gnu.org/gnu/sed

On OS X, the package coreutils can be installed via Homebrew (e.g. brew install coreutils).

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/fq2dna.git

B. Go to the created directory and give the execute permission to the file fq2dna.sh:

cd fq2dna/
chmod +x fq2dna.sh

C. Check the dependencies (and their version) using the following command line:

./fq2dna.sh  -d
If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file fq2dna.sh and indicate the local path to the corresponding binary(ies) within the code block DEPENDENCIES (approximately lines 150-210). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block DEPENDENCIES

program variable assignment program variable assignment
Bakta BAKTA_BIN=bakta; fqstats FQSTATS_BIN=fqstats;
bwa-mem2 BWAMEM2_BIN=bwa-mem2; gawk GAWK_BIN=gawk;
contig_info CONTIG_INFO_BIN=contig_info; ntcard NTCARD_BIN=ntcard;
CoPro COPRO_BIN=CoPro; Platon PLATON_BIN=platon;
FASTA2AGP FASTA2AGP_BIN=FASTA2AGP; samtools SAMTOOLS_BIN=samtools;
fqCleanER FQCLEANER_BIN=fqCleanER; SPAdes SPADES_BIN=spades.py;

Optional programs are not required to be installed/accessible for fq2dna to run.

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as FASTA2AGP is a Java tool that can be run using a Java virtual machine, the executable jar file FASTA2AGP.jar can be used by fq2dna by editing the corresponding variable assignment instruction as follows: FASTA2AGP_BIN="java -jar $PathTo/FASTA2AGP.jar".

D. Execute fq2dna with the following command line model:

./fq2dna.sh  [options]

Usage

Run fq2dna without option to read the following documentation:

 USAGE:  fq2dna.sh  [options]

 Processing and assembling high-throughput sequencing (HTS) paired-end (PE) reads:
  + processing HTS reads  using different steps:  deduplicating [D], trimming/clipping [T], error
    correction [E], contaminant removal [C], merging [M], and/or digital normalization [N]
  + de novo assembly [dna] of  the whole genome from  processed HTS reads,  followed by polishing
    and annotation steps (depending on the specified strategy)

 OPTIONS:
  -1 <infile>   fwd (R1) FASTQ input file name from PE library 1 || input files can be compressed
  -2 <infile>   rev (R2) FASTQ input file name from PE library 1 || using   either   gzip   (file
  -3 <infile>   fwd (R1) FASTQ input file name from PE library 2 || extension .gz), bzip2 (.bz or
  -4 <infile>   rev (R2) FASTQ input file name from PE library 2 || .bz2),   or  DSRC  (.dsrc  or
  -5 <infile>   fwd (R1) FASTQ input file name from PE library 3 || .dsrc2), or uncompressed (.fq
  -6 <infile>   rev (R2) FASTQ input file name from PE library 3 || or .fastq)
  -o <outdir>   path and name of the output directory (mandatory option)
  -b <string>   base name for output files (mandatory option)
  -s <char>     to set a predefined strategy for processing HTS reads among the following ones:
                  A   Archaea:    DT(C)E + [N|MN] + dna + polishing + annotation
                  B   Bacteria:   DT(C)E + [N|MN] + dna + polishing + annotation
                  E   Eukaryote:  DT(C)  + [N|MN] + dna
                  G   Germs:      DT(C)E + [N|MN] + dna + polishing + annotation + classification
                  P   Prokaryote: DT(C)E + [N|MN] + dna + polishing
                  S   Standard:   DT(C)  + [N|MN] + dna + polishing
                  V   Virus:      DT(C)E + [N|MN] + dna + polishing + annotation
                (default: S)
  -L <int>      minimum required length for a contig (default: 300)
  -T <"G S I">  Genus (G), Species (S) and Isolate (I)  names to be used  during annotation step;
                should be set between quotation  marks and separated by a blank space;  only with
                options -s A, -s B, -s G or -s V (default: "Genus sp. STRAIN")
  -q <int>      quality score threshold;  all bases  with Phred  score below  this threshold  are
                considered as non-confident during step [T] (default: 20)
  -l <int>      minimum required length for a read (default: half the average read length)
  -p <int>      maximum allowed percentage  of non-confident bases  (as ruled  by option -q)  per
                read (default: 50)
  -c <int>      minimum allowed coverage depth during step [N] (default: 2)
  -C <int>      maximum allowed coverage depth during step [N] (default: 60)
  -a <infile>   to set a file containing every  alien oligonucleotide sequence  (one per line) to
                be clipped during step [T] (see below)
  -a <string>   one or several key words (separated with commas),  each corresponding to a set of
                alien oligonucleotide sequences to be clipped during step [T]:
                  POLY                nucleotide homopolymers
                  NEXTERA             Illumina Nextera index Kits
                  IUDI                Illumina Unique Dual index Kits
                  AMPLISEQ            AmpliSeq for Illumina Panels
                  TRUSIGHT_PANCANCER  Illumina TruSight RNA Pan-Cancer Kits
                  TRUSEQ_UD           Illumina TruSeq Unique Dual index Kits
                  TRUSEQ_CD           Illumina TruSeq Combinatorial Dual index Kits
                  TRUSEQ_SINGLE       Illumina TruSeq Single index Kits
                Note that these sets of alien sequences are not exhaustive and will never replace
                the exact oligos used for library preparation (default: "POLY")
  -a AUTO       to infer 3' alien oligonucleotide sequence(s) for step [T]; inferred oligo(s) are
                completed with those from "POLY" (see above)
  -A <infile>   to set  sequence or  k-mer model  file(s) to carry  out contaminant  read removal
                during step [C];  several comma-separated  file names  can be specified;  allowed
                file extensions: .fa, .fasta, .fna, .kmr or .kmz
  -e            extensive de novo assembly (slow; default: not set)
  -t <int>      number of threads (default: 12)
  -w <dir>      path to tmp directory (default: $TMPDIR, otherwise /tmp)
  -x            to not remove (after completing) the tmp directory inside the one set with option
                -w (default: not set)
  -d            check dependencies and exit
  -h            print this help and exit

  EXAMPLES:
    fq2dna.sh  -1 r1.fastq -2 r2.fastq  -o out -b ek12  -s G -T "Escherichia coli K12" -a NEXTERA
    fq2dna.sh  -1 rA.1.fq -2 rA.2.fq -3 rB.1.fq.gz -4 rB.2.fq.gz  -o out -b name  -a AUTO
    fq2dna.sh  -1 hts.1.fq.gz -2 hts.2.fq.gz  -o out -b bact -s B  -a IUDI  -t 24  -e

Notes

output file file content
<prefix>.stepI.log fqCleanER log file of the step I
<prefix>.stepM.log fqCleanER log file of the step M
<prefix>.stepN.log fqCleanER log file of the step N
<prefix>.all.fasta all the scaffolds assembled by SPAdes (FASTA format)
<prefix>.cov.info.txt genome coverage profile summary generated using CoPro
<prefix>.scf.fasta selected and polished scaffold sequences (FASTA format)
<prefix>.scf.info.txt residue content of the selected scaffold sequences (tab-delimited)
<prefix>.scf.amb.txt ambiguously assembled bases (tab-delimited)
<prefix>.agp.fasta contigs derived from the scaffold sequences in <prefix>.scf.fasta (FASTA format)
<prefix>.agp scaffolding information associated to <prefix>.agp.fasta (AGP format)
<prefix>.dna.info.txt descriptive statistics of each FASTA file content (tab-delimited)
<prefix>.isd.txt descriptive statistics of the insert size distribution for each specified PE FASTQ files
<prefix>.QCsum.txt quality control metrics (tab-delimited)
<prefix>.gbk annotated scaffold sequences (GenBank flat file format)
<prefix>.gbk.info.txt annotation statistics generated by Bakta
QCsum field QCsum field content
name sample name (option -b)
Npe number of PE reads
Nbps number of sequenced bases
ISavg average insert size
covI initial coverage depth
Nctg number of assembled contigs
Nres total length of the assembled contigs
%GC GC-content
Namb number of ambiguous positions in the assembled scaffolds
N50 N50 of the assembled contigs
auN auN of the assembled contigs
momo mono-modality index of the coverage profile
R2 R2 metric derived from the coverage profile
covE expected coverage depth of the genome assembly (option -C)
covO observed coverage depth of the genome assembly
covF mean of the theoretical distribution fitted to the coverage profile

Example

In order to illustrate the usefulness of fq2dna and to better describe its output files, the following use case example describes its usage for (re)assembling the draft genome of Listeria monocytogenes 2HF33 (Duru et al. 2020).

All output files are available in the directory example/ (the four large sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).

Downloading input files

Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (112 Mb and 128 Mb, respectively) can be downloaded using the following command lines:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_2.fastq.gz

As phiX genome was used as spike-in during Illumina sequencing (Duru et al. 2020), this putative contaminating sequence (5.4 kb) can be downloded using the following command line:

wget -O phiX.fasta "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_001422.1"

Running fq2dna

Below is a typical command line to run fq2dna on PE FASTQ files to deal with bacteria HTS data:

./fq2dna.sh -1 ERR4032786_1.fastq.gz -2 ERR4032786_2.fastq.gz -a AUTO -A phiX.fasta -s G   \\
            -o 2HF33 -b Lm.2HF33 -T "Listeria monocytogenes 2HF33" -t 12 -w /tmp

Of note, option -a AUTO was set to let fq2dna infer the technical oligonucleotide used during library preparation, whereas the putative contaminating sequence file phiX.fasta was set using option -A. The strategy ‘Germs’ was set (option -s G) in order to also perform both scaffold annotation and classification.

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using fq2dna together with program and tool versions listed into example/program.versions.txt:

# fq2dna v25.12
# Copyright (C) 2016-2025 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/fq2dna
> System: x86_64-redhat-linux-gnu
> Bash:   4.4.20(1)-release
# input file(s):
> PE lib 1
+ FQ11: [111.9 Mb]  541e697e87d7d80a70d29d3827f98d6d  ERR4032786_1.fastq.gz
+ FQ12: [127.9 Mb]  651a7443ba56bb52cdf1b43ba9e9d844  ERR4032786_2.fastq.gz
# output directory
+ OUTDIR=2HF33
# tmp directory
+ TMP_DIR=/tmp/fq2dna.Lm.2HF33.lh617yARn
# STRATEGY G: Germs
[00:01] Deduplicating, Trimming, Clipping, Decontaminating and Correcting reads ... [ok]
[02:46] Merging and/or Normalizing reads ... [ok]
[04:01] Approximating genome size ... [ok]
> 3112666 bps (N=3112091 M=3113241)
[04:05] Assembling genome with/without PE read merging (dnaM/dnaN) .... [ok]
> [dnaN]  covr=62   arl=285   k=21,33,55,77,99,121
> [dnaM]  covr=60   arl=281   k=21,33,55,77,99,121
[06:26] Comparing genome assemblies ... [ok]
> [dnaN]  Nseq=31   Nres=3099160   NG50=361897   auGN=462498
> [dnaM]  Nseq=34   Nres=3099305   NG50=452690   auGN=346318
> selecting dnaN
> [dna]   Nseq=31   Nres=3099160    N50=361897    auN=464514
[06:27] Aligning PE reads against scaffolds .... [ok]
[06:40] Polishing scaffolds ... 
> [dna]   Nseq=31   Nres=3099288   NG50=361909   auGN=462513
> [dna]   Nseq=30   Nres=3099356   NG50=361921   auGN=462523
> [dna]   Nseq=30   Nres=3099356    N50=361921    auN=464506
[07:24] Processing scaffolds ... [ok]
> expect. cov.        60
> avg. cov.           60.45
> cov. mode           60
> momo index          0.99913539
> accuracy            0.878995
> min. cov. cutoff    43
> ambiguous pos.      5
> [dna]   Nseq=23   Nres=3096373    N50=361921    auN=464956
[07:43] Annotating scaffold sequences ... [ok]
> taxon:     Listeria monocytogenes 2HF33
> locus tag: LISMON2HF33
# output files:
+ de novo assembly (all scaffolds):         2HF33/Lm.2HF33.all.fasta
+ coverage profile summary:                 2HF33/Lm.2HF33.cov.info.txt
+ de novo assembly (selected scaffolds):    2HF33/Lm.2HF33.scf.fasta
+ sequence stats (selected scaffolds):      2HF33/Lm.2HF33.scf.info.txt
+ ambiguous positions (selected scaffolds): 2HF33/Lm.2HF33.scf.amb.txt
+ de novo assembly (selected contigs):      2HF33/Lm.2HF33.agp.fasta
+ scaffolding info:                         2HF33/Lm.2HF33.agp
+ descriptive statistics:                   2HF33/Lm.2HF33.dna.info.txt
+ insert size statistics:                   2HF33/Lm.2HF33.isd.txt
+ annotation (selected scaffolds):          2HF33/Lm.2HF33.scf.gbk
+ annotation info:                          2HF33/Lm.2HF33.scf.gbk.info.txt
+ quality control summary metrics:          2HF33/Lm.2HF33.QCsum.txt
[16:15] exit 

This log output shows that the complete analysis was performed on 12 threads in ~16 minutes (HTS read processing in ~4 minutes, de novo assemblies in ~2 minutes, scaffold sequence post-processing in ~2 minutes, annotation and classification in ~8 minutes). One can also observe that the selected de novo assembly was dnaN (i.e. lower number of contigs, greater auGN metric), therefore showing that the PE HTS read merging step (M) does not always enable to obtain better genome assemblies. The final genome assembly has a total size of 3.09 Mbps, represented in 23 contigs, with N50=361,921 bps (for comparison sake, the assembly accession of the original assembly is GCF_902838845).

Output files

The genome coverage profile summary file (Lm.2HF33.cov.info.txt; see excerpt below) shows an average coverage depth of ~60×, corresponding to a symmetrical coverage depth distribution with mode 60×, identical to the expected value (default option -C 60). The momo index is 0.9991 (above the mono-modality cutoff, e.g. 0.99), therefore suggesting that no contamination occurred in the HTS reads selected for assembly. The fitted Generalized Poisson (GP) distribution has a mean of 60.09, and its derived metric R2 (= 0.8181) suggests an expected uniform coverage across the assembled positions.

== observed coverage distribution
   no.pos  3101232
   avg     60.45
   mode    60
   momo    0.99913539
** GP(lambda,theta) distribution
   lambda  109.26055591
   theta   -0.81815798
   mean    60.09
   R2      0.87975610
-- confidence interval (p-value=0.00005)
   low     43
   sup     76

These different statistics therefore assess that the assembled sequences are accurate and well-supported by sufficient data. Finally, the GP distribution enables to determine the low coverage cutoff (i.e. 43×) under which any assembled base is considered as putatively not trustworthy (lowercase characters in the scaffold sequence file Lm.2HF33.scf.fasta).

The tab-delimited file Lm.2HF33.dna.info.txt shows that the de novo assembly procedure led to quite few scaffold and contig sequences:

#File               Nseq  Nres     A      C      G      T      N    %A     %C     %G     %T     %N     %AT    %GC     Min Q25  Med   Q75    Max     Avg       auN    N50    N75    N90    L50 L75 L90
Lm.2HF33.all.fasta  35    3101208  943528 604087 568021 985272 300  30.42% 19.47% 18.31% 31.77% 0.00%  62.21% 37.79%  237 372  1664  99066  1091349 88605.94  559194 532745 125907 97509  2   6   10
Lm.2HF33.scf.fasta  21    3096552  942169 603187 566991 984026 179  30.42% 19.47% 18.31% 31.77% 0.00%  62.21% 37.79%  628 4992 64421 125913 1091380 147454.85 560067 532745 125913 97509  2   6   10
Lm.2HF33.agp.fasta  23    3096385  942169 603187 566991 984026 12   30.42% 19.48% 18.31% 31.77% 0.00%  62.21% 37.79%  628 4992 64421 126576 935183  134625.43 464956 361921 125913 90129  3   7   11

Of note, 14 small (min. length = 300 bps; option -L) and/or insufficiently covered sequences (cutoff = 43×; see Lm.2HF33.cov.info.txt) were not selected for the final scaffold sequence set (i.e. Lm.2HF33.scf.fasta), therefore leading to the final assembled genome of total length Nres = 3,096,552 bps, represented in Nseq = 21 scaffold sequences (scaffold and contig N50 = 532,745 and 361,921 bps, respectively).

The tab-delimited file Lm.2HF33.scf.info.txt (displayed below) shows different residue statistics for each final scaffold sequence. Note that each FASTA header info file Lm.2HF33.scf.fasta contains the k-mer coverage depth returned by SPAdes (covk), the read coverage depth estimated by fq2dna (covr), and the low coverage cutoff derived from the coverage profile analysis (cutoff).

#Seq      Nres    A      C      G      T      N  %A     %C     %G     %T     %N    %AT    %GC    Pval   CPU
NODE_0001 1091380 330567 212931 195684 352105 93 30.28% 19.51% 17.92% 32.26% 0.00% 62.56% 37.44% 0.0116 C
NODE_0002 532745  154855 110898 91560  175432 0  29.06% 20.81% 17.18% 32.92% 0.00% 62.00% 38.00% 0.6892 C
NODE_0003 361921  113236 65663  74544  108478 0  31.28% 18.14% 20.59% 29.97% 0.00% 61.27% 38.73% 0.0032 C
NODE_0004 206455  68177  35148  42251  60874  5  33.02% 17.02% 20.46% 29.48% 0.00% 62.51% 37.49% 0.0091 C
NODE_0005 126576  37673  26289  21299  41315  0  29.76% 20.76% 16.82% 32.64% 0.00% 62.41% 37.59% 0.2115 C
NODE_0006 125913  41257  22119  25733  36804  0  32.76% 17.56% 20.43% 29.22% 0.00% 62.00% 38.00% 0.8299 C
NODE_0007 109341  32663  22787  17974  35916  1  29.87% 20.84% 16.43% 32.84% 0.00% 62.73% 37.27% 0.0147 C
NODE_0008 103533  29753  21604  16750  35350  76 28.73% 20.86% 16.17% 34.14% 0.07% 62.93% 37.07% 0.0003 C
NODE_0009 99090   32747  17157  20514  28672  0  33.04% 17.31% 20.70% 28.93% 0.00% 61.99% 38.01% 0.9410 C
NODE_0010 97509   28253  20285  17248  31723  0  28.97% 20.80% 17.68% 32.53% 0.00% 61.51% 38.49% 0.0245 C
NODE_0011 64421   21749  11088  13365  18219  0  33.76% 17.21% 20.74% 28.28% 0.00% 62.05% 37.95% 0.6433 C
NODE_0012 61185   17687  12639  9717   21142  0  28.90% 20.65% 15.88% 34.55% 0.00% 63.47% 36.53% 0.0000 P
NODE_0013 46823   14085  9566   7968   15204  0  30.08% 20.43% 17.01% 32.47% 0.00% 62.56% 37.44% 0.1143 C
NODE_0014 35062   10004  7708   6005   11345  0  28.53% 21.98% 17.12% 32.35% 0.00% 60.89% 39.11% 0.0229 C
NODE_0015 22536   6000   4813   3643   8080   0  26.62% 21.35% 16.16% 35.85% 0.00% 62.48% 37.52% 0.1343 U
NODE_0016 4992    1395   1052   1470   1073   2  27.94% 21.07% 29.44% 21.49% 0.04% 49.46% 50.54% 0.0000 U
NODE_0017 2806    724    661    386    1035   0  25.80% 23.55% 13.75% 36.88% 0.00% 62.69% 37.31% 0.4654 U
NODE_0018 1664    524    292    266    582    0  31.49% 17.54% 15.98% 34.97% 0.00% 66.47% 33.53% 0.0078 U
NODE_0019 1126    398    199    264    263    2  35.34% 17.67% 23.44% 23.35% 0.17% 58.81% 41.19% 0.1331 U
NODE_0020 846     182    200    236    228    0  21.51% 23.64% 27.89% 26.95% 0.00% 48.47% 51.53% 0.0017 U
NODE_0021 628     240    88     114    186    0  38.21% 14.01% 18.15% 29.61% 0.00% 67.84% 32.16% 0.0219 U

The last column CPU also shows the classification of each scaffold sequence into the category ‘Chromosome’, ‘Plasmid’ or ‘Undetermined’ (i.e. C, P, U, respectively), as assessed by Platon via the selected strategy -s G. Such a classification shows that the 12th scaffold sequence (i.e. NODE_0012) is likely a plasmid (subsequent BLAST searches indeed show that it is almost identical to pLM6179).

The tab-delimited file Lm.2HF33.scf.amb.txt (displayed below) lists the Namb = 5 assembled positions that are likely multi-allelic.

#seq      lgt     pos    base %A    %C    %G    %T    %gap
NODE_0001 1091380 481973 N    50.0  0     50.0  0     0
NODE_0004 206455  203837 N    0     39.6  0     60.4  0
NODE_0004 206455  203838 N    0     42.6  0     57.4  0
NODE_0004 206455  203840 N    59.6  0     40.4  0     0
NODE_0004 206455  204068 N    37.0  0     63.0  0     0

References

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477. doi:10.1089/cmb.2012.0021.

Duru IC, Andreevskaya M, Laine P, Rode TN, Ylinen A, Løvdal T, Bar N, Crauwels P, Riedel CU, Bucur FI, Nicolau AI, Auvinen P (2020) Genomic characterization of the most barotolerant Listeria monocytogenes RO15 strain compared to reference strains used to evaluate food high pressure processing. BMC Genomics, 21:455. doi:10.1186/s12864-020-06819-0.

Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147.

Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.

Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A (2020) Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10):mgen000398. doi:10.1099/mgen.0.000398.

Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A (2021) Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11):000685. doi:10.1099/mgen.0.000685.

Citations

Abou Fayad A, Rafei R, Njamkepo E, Ezzeddine J, Hussein H, Sinno S, Gerges J-R, Barada S, Sleiman A, Assi M, Baakliny M, Hamedeh L, Mahfouz R, Dabboussi F, Feghali R, Mohsen Z, Rady A, Ghosn N, Abiad F, Abubakar A, Barakat A, Wauquier N, Quilici M-L, Hamze M, Weill F-X, Matar GM (2024) An unusual two-strain cholera outbreak in Lebanon, 2022-2023: a genomic epidemiology study. Nature Communications, 15:6963. doi:10.1038/s41467-024-51428-0

Crestani C, Arcari G, Landier A, Passet V, Garnier D, Brémont S, Armatys N, Carmi-Leroy A, Toubiana J, Badell E, Brisse S (2023) Corynebacterium ramonii sp. nov., a novel toxigenic member of the Corynebacterium diphtheriae species complex. Research in Microbiology, 174(7):104113. doi:10.1016/j.resmic.2023.104113

Crestani C, Passet V, Rethoret-Pasty M, Zidane N, Brémont S, Badell E, Criscuolo A, Brisse S (2025) Microevolution and genomic epidemiology of the diphtheria-causing zoonotic pathogen Corynebacterium ulcerans. Nature Communications, 16:4843. doi:10.1038/s41467-025-60065-0

Crippa C, Pasquali F, Rodrigues C, De Cesare A, Lucchi A, Gambi L, Manfreda G, Brisse S, Palma F (2023) Genomic features of Klebsiella isolates from artisanal ready-to-eat food production facilities. Scientific Reports, 13:10957. doi:10.1038/s41598-023-37821-7

Hawkey J, Frézal L, Tran Dien A, Zhukova A, Brown D, Chattaway MA, Simon S, Izumiya H, Fields PI, De Lappe N, Kaftyreva L, Xu X, Isobe J, Clermont D, Njamkepo E, Akeda Y, Issenhuth-Jeanjean S, Makarova M, Wang Y, Hunt M, Jenkins BM, Ravel M, Guibert V, Serre E, Matveeva Z, Fabre L, Cormican M, Yue M, Zhu B, Morita M, Iqbal Z, Nodari CS, Pardos de la Gandara M, Weill F-X (2024) Genomic perspective on the bacillus causing paratyphoid B fever. Nature Communications, 15:10143. doi:10.1038/s41467-024-54418-4

Kämpfer P, Glaeser SP, Busse H-J, McInroy JA, Clermont D, Criscuolo A (2022) Pseudoneobacillus rhizosphaerae gen. nov., sp. nov., isolated from maize root rhizosphere. International Journal of Systematic and Evolutionary Biology, 72:5. doi:10.1099/ijsem.0.005367

Kämpfer P, Lipski A, Lamothe L, Clermont D, Criscuolo A, McInroy JA, Glaeser SP (2022) Paenibacillus allorhizoplanae sp. nov. from the rhizoplane of a Zea mays root. Archives of Microbiology, 204:630. doi:10.1007/s00203-022-03225-w

Kämpfer P, Lipski A, Lamothe L, Clermont D, Criscuolo A, McInroy JA, Glaeser SP (2023) Paenibacillus plantiphilus sp. nov. from the plant environment of Zea mays. Antonie van Leeuwenhoek, 116:883-892. doi:10.1007/s10482-023-01852-x

Kämpfer P, Lipski A, McInroy JA, Clermont D, Criscuolo A, Glaeser SP (2022) Bacillus rhizoplanae sp. nov. from maize roots. International Journal of Systematic and Evolutionary Biology, 72:7. doi:10.1099/ijsem.0.005450

Kämpfer P, Glaeser SP, McInroy JA, Busse H-J, Clermont D, Criscuolo A (2024) Description of Cohnella rhizoplanae sp. nov., isolated from the root surface of soybean (Glycine max). Antonie van Leeuwenhoek, 118:41. doi:10.1007/s10482-024-02051-y

Palma F, Mangone I, Janowicz A, Moura A, Chiaverini A, Torresi M, Garofolo G, Criscuolo A, Brisse S, Di Pasquale A, Cammà C, Radomski N (2022) In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes. BMC Genomics, 23:235. doi:10.1186/s12864-022-08437-4

Rahi P, Mühle E, Scandola C, Touak G, Clermont D (2024) Genome sequence-based identification of Enterobacter strains and description of Enterobacter pasteurii sp. nov. Microbiology Spectrum, 12:e03150-23. doi:10.1128/spectrum.03150-23

Shelomi M, Han C-J, Chen W-M​, Chen H-K​, Liaw S-J​, Mühle E​, Clermont5 D (2023) Chryseobacterium oryctis sp. nov., isolated from the gut of the beetle Oryctes rhinoceros, and Chryseobacterium kimseyorum sp. nov., isolated from a stick insect rearing cage. International Journal of Systematic and Evolutionary Biology, 73:4. doi:10.1099/ijsem.0.005813

Vautrin N, Alexandre K, Pestel-Caron M, Bernard E, Fabre R, Leoz M, Dahyot S, Caron F (2023) Contribution of Antibiotic Susceptibility Testing and CH Typing Compared to Next-Generation Sequencing for the Diagnosis of Recurrent Urinary Tract Infections Due to Genetically Identical Escherichia coli Isolates: a Prospective Cohort Study of Cystitis in Women. Microbiology Spectrum, 11(4):e02785-22. doi:10.1128/spectrum.02785-22