GPLv3 license Bash publication

fq2dna

fq2dna (FASTQ files to de novo assembly) is a command line tool written in Bash to ease the de novo assembly of archaea, bacteria or virus genomes from raw high-throughput sequencing (HTS) paired-end (PE) reads.

Every data pre- and post-processing step is managed by fq2dna (e.g. HTS read filtering and enhancing, well-tuned de novo assemblies, scaffold sequence accuracy assessment). The main purpose of fq2dna is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 5 to 20 minutes to deal with a bacteria HTS sample using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.

fq2dna runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.

program package version sources
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
bwa-mem2 - ≥ 2.2.1 gitlab.pasteur.fr/GIPhy/contig_info
contig_info - > 2.0 gitlab.pasteur.fr/GIPhy/contig_info
FASTA2AGP - ≥ 2.0 gitlab.pasteur.fr/GIPhy/FASTA2AGP
fqCleanER - ≥ 23.12 gitlab.pasteur.fr/GIPhy/fqCleanER
fqstats fqtools ≥ 1.2 ftp.pasteur.fr/pub/gensoft/projects/fqtools
ntCard - > 1.2 github.com/bcgsc/ntCard
Platon - > 1.5 github.com/oschwengers/platon
Prokka - ≥ 1.14.5 github.com/tseemann/prokka
samtools - ≥ 1.18 github.com/samtools/samtools
sourceforge.net/projects/samtools
SAM2MAP SAM2MSA ≥ 0.4.3.1 gitlab.pasteur.fr/GIPhy/SAM2MSA
SPAdes - ≥ 3.15.5 github.com/ablab/spades

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/fq2dna.git

B. Go to the created directory and give the execute permission to the file fq2dna.sh:

cd fq2dna/ 
chmod +x fq2dna.sh

C. Check the dependencies (and their version) using the following command line:

./fq2dna.sh  -d

D. If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), it should be manually specified. To specify the location of a specific binary, edit the file fq2dna.sh and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS (approximately lines 80-200). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS

program variable assignment program variable assignment
bwa-mem2 BWAMEM2_BIN=bwa-mem2; ntcard NTCARD_BIN=ntcard;
contig_info CONTIG_INFO_BIN=contig_info; Platon PLATON_BIN=platon;
FASTA2AGP FASTA2AGP_BIN=FASTA2AGP; Prokka PROKKA_BIN=prokka;
fqCleanER FQCLEANER_BIN=fqCleanER; samtools SAMTOOLS_BIN=samtools;
fqstats FQSTATS_BIN=fqstats; SAM2MAP SAM2MAP_BIN=SAM2MAP;
gawk GAWK_BIN=gawk; SPAdes SPADES_BIN=spades.py;

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as FASTA2AGP is a Java tool that can be run using a Java virtual machine, the executable jar file FASTA2AGP.jar can be used by fq2dna by editing the corresponding variable assignment instruction as follows: FASTA2AGP_BIN="java -jar FASTA2AGP.jar".

E. Execute fq2dna with the following command line model:

./fq2dna.sh  [options]

Usage

Run fq2dna without option to read the following documentation:

 USAGE:  fq2dna.sh  [options] 

 Processing and assembling high-throughput sequencing (HTS) paired-end (PE) reads:
  + processing HTS reads  using different steps:  deduplicating [D], trimming/clipping [T], error
    correction [E], contaminant removal [C], merging [M], and/or digital normalization [N] 
  + de novo assembly [dna] of  the whole genome from  processed HTS reads,  followed by polishing
    and annotation steps (depending on the specified strategy) 

 OPTIONS:
  -1 <infile>   fwd (R1) FASTQ input file name from PE library 1 || input files can be compressed
  -2 <infile>   rev (R2) FASTQ input file name from PE library 1 || using   either   gzip   (file
  -3 <infile>   fwd (R1) FASTQ input file name from PE library 2 || extension .gz), bzip2 (.bz or
  -4 <infile>   rev (R2) FASTQ input file name from PE library 2 || .bz2),   or  DSRC  (.dsrc  or
  -5 <infile>   fwd (R1) FASTQ input file name from PE library 3 || .dsrc2), or uncompressed (.fq 
  -6 <infile>   rev (R2) FASTQ input file name from PE library 3 || or .fastq)                 
  -o <outdir>   path and name of the output directory (mandatory option)
  -b <string>   base name for output files (mandatory option)
  -s <char>     to set a predefined strategy for processing HTS reads among the following ones:
                  A   Archaea:     DT(C)E + [N|MN] + dna + polishing + annotation
                  B   Bacteria:    DT(C)E + [N|MN] + dna + polishing + annotation 
                  E   Eukaryote:   DT(C)  + [N|MN] + dna
                  P   Prokaryote:  DT(C)E + [N|MN] + dna + polishing
                  S   Standard:    DT(C)  + [N|MN] + dna + polishing
                  V   Virus:       DT(C)E + [N|MN] + dna + polishing + annotation
                (default: S)
  -L <int>      minimum required length for a contig (default: 300)
  -T <"G S I">  Genus (G), Species (S) and Isolate (I)  names to be used  during annotation step;
                should be set between quotation  marks and separated by a blank space;  only with
                options -s A, -s B or -s V (default: "Genus sp. STRAIN")
  -q <int>      quality score threshold;  all bases  with Phred  score below  this threshold  are 
                considered as non-confident during step [T] (default: 20)
  -l <int>      minimum required length for a read (default: half the average read length)
  -p <int>      maximum allowed percentage  of non-confident bases  (as ruled  by option -q)  per 
                read (default: 50) 
  -c <int>      minimum allowed coverage depth during step [N] (default: 3)
  -C <int>      maximum allowed coverage depth during step [N] (default: 60)
  -a <infile>   to set a file containing every  alien oligonucleotide sequence  (one per line) to
                be clipped during step [T] (see below)
  -a <string>   one or several key words (separated with commas),  each corresponding to a set of
                alien oligonucleotide sequences to be clipped during step [T]:
                  POLY                nucleotide homopolymers
                  NEXTERA             Illumina Nextera index Kits
                  IUDI                Illumina Unique Dual index Kits
                  AMPLISEQ            AmpliSeq for Illumina Panels
                  TRUSIGHT_PANCANCER  Illumina TruSight RNA Pan-Cancer Kits
                  TRUSEQ_UD           Illumina TruSeq Unique Dual index Kits
                  TRUSEQ_CD           Illumina TruSeq Combinatorial Dual index Kits
                  TRUSEQ_SINGLE       Illumina TruSeq Single index Kits
                  TRUSEQ_SMALLRNA     Illumina TruSeq Small RNA Kits
                Note that these sets of alien sequences are not exhaustive and will never replace
                the exact oligos used for library preparation (default: "POLY")
  -a AUTO       to (try to)  infer 3' alien  oligonucleotide  sequence(s) for step [T];  inferred
                oligo(s) are completed with those from "POLYS" (see above)                
  -A <infile>   to set  sequence or  k-mer model  file(s) to carry  out contaminant  read removal 
                during step [C];  several comma-separated  file names  can be specified;  allowed 
                file extensions: .fa, .fasta, .fna, .kmr or .kmz
  -t <int>      number of threads (default: 12)
  -w <dir>      path to tmp directory (default: $TMPDIR, otherwise /tmp)
  -x            to not remove (after completing) the tmp directory inside the one set with option
                -w (default: not set)
  -d            checks dependencies and exit
  -h            prints this help and exit

  EXAMPLES:
    fq2dna.sh  -1 r1.fastq -2 r2.fastq  -o out -b ek12  -s P -T "Escherichia coli K12" -a AUTO 
    fq2dna.sh  -1 rA.1.fq -2 rA.2.fq -3 rB.1.fq.gz -4 rB.2.fq.gz  -o out -b name  -a NEXTERA

Notes

output file file content
<prefix>.stepI.log fqCleanER log file of the step I
<prefix>.stepM.log fqCleanER log file of the step M
<prefix>.stepN.log fqCleanER log file of the step N
<prefix>.all.fasta the less fragmented SPAdes assembly (FASTA format)
<prefix>.cov.info.txt coverage profile summary generated using SAM2MAP
<prefix>.scf.fasta selected and polished scaffold sequences (FASTA format)
<prefix>.scf.info.txt residue content of the selected scaffold sequences (tab-delimited)
<prefix>.scf.amb.txt ambiguously assembled bases (tab-delimited)
<prefix>.agp.fasta contigs derived from the selected and polished scaffold sequences (FASTA format)
<prefix>.agp scaffolding information associated to the contigs (AGP format)
<prefix>.dna.info.txt descriptive statistics of each FASTA file content (tab-delimited)
<prefix>.isd.txt descriptive statistics of the insert size distribution
<prefix>.gbk assembled genome annotation (GenBank flat file format)
<prefix>.gbk.info.txt annotation statistics generated by Prokka

Example

In order to illustrate the usefulness of fq2dna and to better describe its output files, the following use case example describes its usage for (re)assembling the draft genome of Listeria monocytogenes 2HF33 (Duru et al. 2020).

All output files are available in the directory example/ (the four large sequence files are compressed using gzip), as well as the version of every used tool and program (see program.versions.txt).

Downloading input files

Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (112 Mb and 128 Mb, respectively) can be downloaded using the following command lines:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_2.fastq.gz

As phiX genome was used as spike-in during Illumina sequencing (Duru et al. 2020), this putative contaminating sequence (5.4 kb) can be downloded using the following command line:

wget -O phiX.fasta "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_001422.1"

Running fq2dna

Below is a typical command line to run fq2dna on PE FASTQ files to deal with bacteria HTS data:

./fq2dna.sh -1 ERR4032786_1.fastq.gz -2 ERR4032786_2.fastq.gz -a NEXTERA -A phiX.fasta   \\
            -s B -T "Listeria monocytogenes 2HF33" -o 2HF33 -b Lm.2HF33 -t 12 -w /tmp

Of note, option -a NEXTERA was set to clip the Nextera XT technical oligonucleotides used during library preparation (see Duru et al. 2020), the putative contaminating sequence file phiX.fasta was set using option -A, and the strategy ‘Bacteria’ was set (option -s B).

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using fq2dna together with program and tool versions listed into example/program.versions.txt:

# fq2dna v24.02
# Copyright (C) 2016-2024 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/fq2dna
> System: x86_64-redhat-linux-gnu
> Bash:   4.4.20(1)-release
# input file(s):
> PE lib 1
+ FQ11: [111.9 Mb]  ERR4032786_1.fastq.gz
+ FQ12: [127.8 Mb]  ERR4032786_2.fastq.gz
# output directory
+ OUTDIR=2HF33
# tmp directory
+ TMP_DIR=/tmp/fq2dna.Lm.2HF33.3wOuEayqB
# STRATEGY B: Bacteria
[00:01] Deduplicating, Trimming, Clipping, Decontaminating and Correcting reads ... [ok]
[03:05] Merging and/or Normalizing reads ... [ok]
[04:26] Approximating genome size ... [ok]
> 3112890 bps (N=3112026 M=3113754)
[04:31] Assembling genome with/without PE read merging (dnaM/dnaN) ... [ok]
> [dnaN]  covr=67   arl=278   k=21,33,55,77,99,121
> [dnaM]  covr=60   arl=279   k=21,33,55,77,99,121
[06:07] Comparing genome assemblies ... [ok]
> [dnaN]  Nseq=30   Nres=3099162   NG50=532745   auGN=431350
> [dnaM]  Nseq=32   Nres=3099079   NG50=452691   auGN=346759
> selecting dnaN
> [dna]   Nseq=30   Nres=3099162    N50=532745    auN=433261
[06:09] Aligning PE reads against scaffolds .... [ok]
[06:20] Polishing scaffolds ... 
> [dna]   Nseq=30   Nres=3099325   NG50=532745   auGN=431353
> [dna]   Nseq=30   Nres=3099291   NG50=532745   auGN=431353
> [dna]   Nseq=30   Nres=3099291    N50=532745    auN=433241
[06:58] Processing scaffolds ... [ok]
> expect. cov.        60
> avg. cov.           65.3020
> cov. mode           65
> momo index          0.999245
> min. cov. cutoff    40
> ambiguous pos.      18
> [dna]   Nseq=26   Nres=3097544    N50=532745    auN=433485
[07:21] Annotating scaffold sequences ... [ok]
> taxon:     Listeria monocytogenes 2HF33
> locus tag: LISMON2HF33
# output files:
+ de novo assembly (all scaffolds):         2HF33/Lm.2HF33.all.fasta
+ coverage profile summary:                 2HF33/Lm.2HF33.cov.info.txt
+ de novo assembly (selected scaffolds):    2HF33/Lm.2HF33.scf.fasta
+ sequence stats (selected scaffolds):      2HF33/Lm.2HF33.scf.info.txt
+ ambiguous positions (selected scaffolds): 2HF33/Lm.2HF33.scf.amb.txt
+ de novo assembly (selected contigs):      2HF33/Lm.2HF33.agp.fasta
+ scaffolding info:                         2HF33/Lm.2HF33.agp
+ descriptive statistics:                   2HF33/Lm.2HF33.dna.info.txt
+ insert size statistics:                   2HF33/Lm.2HF33.isd.txt
+ annotation (selected scaffolds):          2HF33/Lm.2HF33.scf.gbk
+ annotation info:                          2HF33/Lm.2HF33.scf.gbk.info.txt
[11:47] exit 

This log output shows that the complete analysis was performed on 12 threads in less than 15 minutes (HTS read processing in < 5 minutes, de novo assemblies in ~2 minutes, and scaffold sequence post-processing in ~4 minutes). One can also observe that the selected de novo assembly was dnaN (i.e. greater NG50 and auGN metrics), therefore showing that the PE HTS read merging step (M) does not always enable to obtain better genome assemblies.

Output files

The genome coverage profile summary file (Lm.2HF33.cov.info.txt; see excerpt below) shows an average coverage depth of ~65x, corresponding to a symmetrical coverage depth distribution with mode 65×, close to the expected value (default option -C 60). The momo index is 0.999245 (above the mono-modality cutoff, e.g. 0.99), therefore suggesting that no contamination occurred in the HTS reads selected for assembly.

=  observed coverage distribution:  no.pos=3100527
   avg=65.3020  mode=65  momo=0.999245

#  Poisson(l) coverage tail distribution:  l=0.03409091  w=0.00014191
*  GP(l',r) coverage distribution:  l'=96.67650600  r=-0.47643024  1-w=0.99985809
   min.cov.cutoff=40 (p-value<=0.000001)

These different statistics therefore assess that the assembled sequences are accurate and well-supported by sufficient data. Finally, a generalized Poisson (GP) distribution (that fits the observed genome coverage profile) enables to determine a minimum coverage cutoff (i.e. 40×) under which any assembled base is considered as putatively not trustworthy (lowercase characters in the scaffold sequence file Lm.2HF33.scf.fasta).

The tab-delimited file Lm.2HF33.dna.info.txt shows that the de novo assembly procedure led to quite few scaffold and contig sequences:

#File               Nseq Nres    A      C      G      T      N   %A     %C     %G     %T     %N    %AT    %GC    Min Q25  Med   Q75    Max    Avg       auN    N50    N75    N90   L50 L75 L90
Lm.2HF33.all.fasta  32   3100379 950305 602779 569035 978060 200 30.65% 19.44% 18.35% 31.54% 0.00% 62.21% 37.79% 237 456  2806  99066  685549 96886.84  433875 532745 125907 97509 3   6   10
Lm.2HF33.scf.fasta  24   3097763 949601 602201 568428 977314 219 30.65% 19.43% 18.34% 31.54% 0.00% 62.21% 37.79% 416 1259 46823 108604 685549 129073.45 434244 532745 125913 97509 3   6   10
Lm.2HF33.agp.fasta  26   3097577 949601 602201 568428 977314 33  30.65% 19.44% 18.35% 31.55% 0.00% 62.21% 37.79% 416 1677 35062 108604 685549 119137.57 433485 532745 125913 90130 3   6   10

Of note, 8 small (min. length = 300 bps; option -L) and/or insufficiently covered sequences (cutoff = 40×; see Lm.2HF33.cov.info.txt) were not selected for the final scaffold sequence set (i.e. Lm.2HF33.scf.fasta), therefore leading to a final assembled genome of total size 3,097,577 bps (Nres), represented in 26 contigs (Nseq), with 37.79% GC-content.

The tab-delimited file Lm.2HF33.scf.info.txt (displayed below) shows different residue statistics for each final scaffold sequence. Note that each FASTA header (info files Lm.2HF33.all.fasta and Lm.2HF33.scf.fasta) contains the k-mer coverage depth returned by SPAdes (covk), the base coverage depth estimated by fq2dna (covr), and the (low) coverage cutoff derived from the coverage profile analysis (cutoff).

#Seq      Nres   A      C      G      T      N   %A     %C     %G     %T     %N    %AT    %GC     Pval    CPU
NODE_0001 685549 215815 127652 129467 212612 3   31.48% 18.62% 18.88% 31.01% 0.00% 62.50% 37.50%  0.3505  C
NODE_0002 612275 178849 125728 103197 204497 4   29.21% 20.53% 16.85% 33.39% 0.00% 62.62% 37.38%  0.0004  C
NODE_0003 532745 154855 110898 91560  175432 0   29.06% 20.81% 17.18% 32.92% 0.00% 62.00% 38.00%  0.7214  C
NODE_0004 361897 108473 74537  65659  113228 0   29.97% 20.59% 18.14% 31.28% 0.00% 61.27% 38.73%  0.0007  C
NODE_0005 126569 41312  21299  26286  37672  0   32.63% 16.82% 20.76% 29.76% 0.00% 62.41% 37.59%  0.0509  C
NODE_0006 125913 41257  22119  25733  36804  0   32.76% 17.56% 20.43% 29.22% 0.00% 62.00% 38.00%  0.8123  C
NODE_0007 108604 32364  22660  17840  35740  0   29.80% 20.86% 16.42% 32.90% 0.00% 62.71% 37.29%  0.0176  C
NODE_0008 103539 35348  16745  21600  29751  95  34.13% 16.17% 20.86% 28.73% 0.09% 62.94% 37.06%  0.0030  C
NODE_0009 99077  32743  17154  20512  28668  0   33.04% 17.31% 20.70% 28.93% 0.00% 61.99% 38.01%  0.9059  C
NODE_0010 97509  31723  17248  20285  28253  0   32.53% 17.68% 20.80% 28.97% 0.00% 61.51% 38.49%  0.0112  C
NODE_0011 64128  21645  11027  13314  18142  0   33.75% 17.19% 20.76% 28.29% 0.00% 62.05% 37.95%  0.8180  C
NODE_0012 61185  21142  9717   12639  17687  0   34.55% 15.88% 20.65% 28.90% 0.00% 63.47% 36.53%  0.0000  P
NODE_0013 46823  14085  9566   7968   15204  0   30.08% 20.43% 17.01% 32.47% 0.00% 62.56% 37.44%  0.1035  C
NODE_0014 35062  10004  7708   6005   11345  0   28.53% 21.98% 17.12% 32.35% 0.00% 60.89% 39.11%  0.0245  C
NODE_0015 22536  6000   4813   3643   8080   0   26.62% 21.35% 16.16% 35.85% 0.00% 62.48% 37.52%  0.1312  U
NODE_0016 5088   1071   1466   1052   1391   108 21.04% 28.81% 20.67% 27.33% 2.12% 49.44% 50.56%  0.0000  U
NODE_0017 2806   724    661    386    1035   0   25.80% 23.55% 13.75% 36.88% 0.00% 62.69% 37.31%  0.4608  U
NODE_0018 1677   529    293    266    589    0   31.54% 17.47% 15.86% 35.12% 0.00% 66.67% 33.33%  0.0076  U
NODE_0019 1259   499    220    218    320    2   39.63% 17.47% 17.31% 25.41% 0.15% 65.16% 34.84%  0.0603  C
NODE_0020 1116   397    198    260    261    0   35.57% 17.74% 23.29% 23.38% 0.00% 58.97% 41.03%  0.1412  U
NODE_0021 859    230    234    204    186    5   26.77% 27.24% 23.74% 21.65% 0.58% 48.72% 51.28%  0.0151  U
NODE_0022 622    239    84     114    183    2   38.42% 13.50% 18.32% 29.42% 0.32% 68.07% 31.93%  0.0646  U
NODE_0023 509    135    101    120    153    0   26.52% 19.84% 23.57% 30.05% 0.00% 56.59% 43.41%  0.0267  U
NODE_0027 416    162    73     100    81     0   38.94% 17.54% 24.03% 19.47% 0.00% 58.42% 41.58%  0.1963  C

The last column CPU also shows the classification of each scaffold sequence into the category ‘Chromosome’, ‘Plasmid’ or ‘Undetermined’ (i.e. C, P, U, respectively). Such a classification shows that the 12th scaffold sequence (i.e. NODE_0012) is likely a plasmid (subsequent BLAST searches indeed show that it is almost identical to pLM6179).

The tab-delimited file Lm.2HF33.scf.amb.txt (displayed below) lists the different assembled positions that are likely non mono-allelic.

#seq       lgt     pos     base  %A    %C    %G    %T    %gap
NODE_0001  685549  203562  N     51.2  0     0     48.8  0
NODE_0001  685549  203577  N     0     53.3  0     46.7  0
NODE_0002  612275  2385    N     0     61.7  0     38.3  0
NODE_0002  612275  2613    N     2.1   36.2  0     61.7  0
NODE_0002  612275  2615    N     60.4  0     39.6  0     0
NODE_0002  612275  2616    N     60.4  0     39.6  0     0
NODE_0008  103539  89855   N     27.1  0     72.9  0     0
NODE_0008  103539  89906   N     0     73.1  0     26.9  0
NODE_0008  103539  89960   N     0     23.8  0     76.2  0
NODE_0016  5088    4713    N     0     78.6  0     21.4  0
NODE_0016  5088    4715    N     0     23.6  0     76.4  0
NODE_0019  1259    860     N     39.4  0     0     60.6  0
NODE_0021  859     349     N     0     76.9  0     0     23.1
NODE_0021  859     350     N     0     76.5  0     23.5  0
NODE_0021  859     352     N     24.0  0     0     76.0  0
NODE_0021  859     527     N     60.9  0     0     0     39.1
NODE_0022  622     468     N     0     72.9  0     27.1  0
NODE_0022  622     475     N     0     72.3  0     27.7  0

As the overall sequencing seems to not suffer from contamination (see above), (some of) these different ambiguous positions can be indicative of repeat regions of the genome.

References

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477. doi:10.1089/cmb.2012.0021.

Duru IC, Andreevskaya M, Laine P, Rode TN, Ylinen A, Løvdal T, Bar N, Crauwels P, Riedel CU, Bucur FI, Nicolau AI, Auvinen P (2020) Genomic characterization of the most barotolerant Listeria monocytogenes RO15 strain compared to reference strains used to evaluate food high pressure processing. BMC Genomics, 21:455. doi:10.1186/s12864-020-06819-0.

Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147.

Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.

Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A (2020) Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10):mgen000398. doi:10.1099/mgen.0.000398.

Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. doi:10.1093/bioinformatics/btu153.