fq2dna

fq2dna (FASTQ files to de novo assembly) is a command line tool written in Bash to ease the de novo assembly of archaea, bacteria or virus genomes from raw high-throughput sequencing (HTS) paired-end (PE) reads.

Every data pre- and post-processing step is managed by fq2dna (e.g. HTS read filtering and enhancing, well-tuned de novo assemblies, scaffold sequence accuracy assessment). The main purpose of fq2dna is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 5 to 15 minutes to deal with a bacteria HTS sample using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.

fq2dna runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.

program package version sources
contig_info - > 2.0 gitlab.pasteur.fr/GIPhy/contig_info
FASTA2AGP - ≥ 2.0 gitlab.pasteur.fr/GIPhy/FASTA2AGP
fqCleanER - 21.06 gitlab.pasteur.fr/GIPhy/fqCleanER
fqstats fqtools 1.1a ftp.pasteur.fr/pub/gensoft/projects/fqtools
gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
minimap 2 - > 2.17 github.com/lh3/minimap2
ntCard - > 1.2 github.com/bcgsc/ntCard
Platon - > 1.5 github.com/oschwengers/platon
Prokka - ≥ 1.14.5 github.com/tseemann/prokka
SAM2MAP SAM2MSA ≥ 0.3.3.1 gitlab.pasteur.fr/GIPhy/SAM2MSA
SPAdes - > 3.15.1 github.com/ablab/spades

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/fq2dna.git

B. Give the execute permission to the file fq2dna.sh:

chmod +x fq2dna.sh

C. Execute fq2dna with the following command line model:

./fq2dna.sh  [options]

D. If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), fq2dna will exit with an error message. When running fq2dna without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before existing. In such a case, edit the file fq2dna.sh and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS (approximately lines 60-150). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS

program variable assignment program variable assignment
contig_info CONTIG_INFO_BIN=contig_info; ntcard NTCARD_BIN=ntcard;
FASTA2AGP FASTA2AGP_BIN=FASTA2AGP; Platon PLATON_BIN=platon;
fqCleanER FQCLEANER_BIN=fqCleanER; Prokka PROKKA_BIN=prokka;
fqstats FQSTATS_BIN=fqstats; SAM2MAP SAM2MAP_BIN=SAM2MAP;
gawk GAWK_BIN=gawk; SPAdes SPADES_BIN=spades.py;

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as FASTA2AGP is a Java tool that can be run using a Java virtual machine, the executable jar file FASTA2AGP.jar can be used by fq2dna by editing the corresponding variable assignment instruction as follows: FASTA2AGP_BIN="java -jar FASTA2AGP.jar".

Usage

Run fq2dna without option to read the following documentation:

 USAGE:  fq2dna.sh  [options] 

 Processing and assembling high-throughput sequencing (HTS) paired-end reads:
  + processing HTS reads using different steps: deduplicating [D], trimming/clipping [T], error
    correction [E], contaminant removal [C], merging [M], and/or digital normalization [N] 
  + de novo assembly [dna] of the whole genome from processed HTS reads

 OPTIONS:
  -1 <infile>   fwd (R1) FASTQ-formatted input file name from paired-ends (PE) library 1; every
                input file can  be uncompressed  (file extension: .fastq or .fq)  or compressed
                using either gzip (.gz), bzip2 (.bz or .bz2) or DSRC (.dsrc or .dsrc2)
  -2 <infile>   rev (R2) FASTQ-formatted input file name from PE library 1
  -3 <infile>   fwd (R1) FASTQ-formatted input file name from PE library 2
  -4 <infile>   rev (R2) FASTQ-formatted input file name from PE library 2
  -5 <infile>   fwd (R1) FASTQ-formatted input file name from PE library 3
  -6 <infile>   rev (R2) FASTQ-formatted input file name from PE library 3
  -o <outdir>   path and name of the output directory (mandatory option)
  -b <string>   base name for output files (mandatory option)
  -s <char>     to set a predefined strategy for processing HTS reads among the following ones:
                  A   Archaea:     DT(C)E + [N|MN] + dna + annotation
                  B   Bacteria:    DT(C)E + [N|MN] + dna + annotation 
                  E   Eukaryote:   DT(C)  + [N|MN] + dna
                  P   Prokaryote:  DT(C)E + [N|MN] + dna 
                  S   Standard:    DT(C)  + [N|MN] + dna 
                  V   Virus:       DT(C)E + [N|MN] + dna + annotation
                (mandatory option; default: S)
  -L <int>      minimum required length for a contig (default: 300)
  -T <"G S I">  Genus (G), Species (S) and Isolate (I) names to be used during annotation step;
                should be set between quotation marks and separated by a blank space; only with
                options -s A, -s B or -s V (default: "Genus sp. STRAIN")
  -q <int>      quality score threshold;  all bases with Phred  score below  this threshold are 
                considered as non-confident during step [T] (default: 15)
  -l <int>      minimum required length for a read (default: 50)
  -p <int>      maximum allowed percentage of non-confident bases  (as ruled by option -q)  per 
                read (default: 50) 
  -c <int>      minimum allowed coverage depth during step [N] (default: 2)
  -C <int>      maximum allowed coverage depth during step [N] (default: 60)
  -a <infile>   to set a file containing every alien oligonucleotide sequence (one per line) to
                be clipped during step [T] (see below)
  -a <string>   one or several key words  (separated with commas),  each corresponding to a set 
                of alien oligonucleotide sequences to be clipped during step [T]:
                  POLY                nucleotide homopolymers
                  NEXTERA             Illumina Nextera index Kits
                  IUDI                Illumina Unique Dual index Kits
                  AMPLISEQ            AmpliSeq for Illumina Panels
                  TRUSIGHT_PANCANCER  Illumina TruSight RNA Pan-Cancer Kits
                  TRUSEQ_UD           Illumina TruSeq Unique Dual index Kits
                  TRUSEQ_CD           Illumina TruSeq Combinatorial Dual index Kits
                  TRUSEQ_SINGLE       Illumina TruSeq Single index Kits
                  TRUSEQ_SMALLRNA     Illumina TruSeq Small RNA Kits
                Note that these sets of alien oligonucleotide  sequences are not exhaustive and
                will never  replace the  exact oligos  used for  library preparation  (default: 
                "POLY")
  -a AUTO       to perform de  novo inference of  3' alien  oligonucleotide  sequence(s)  of at 
                least 20 nucleotide length for step [T]; selected sequences are completed  with
                those from "POLYS" (see above)                
  -A <infile>   to set sequence or k-mer  model file(s)  to carry out  contaminant read removal 
                during step [C];  several comma-separated file names can be specified;  allowed 
                file extensions: .fa, .fasta, .fna, .kmr or .kmz
  -t <int>      number of threads (default: 12)
  -w <dir>      tmp directory (default: $TMPDIR, otherwise /tmp)
  -x            to not remove  (after completing)  the tmp directory  inside  the one  set with 
                option -w (default: not set)
  -h            prints this help and exit

  EXAMPLES:
    fq2dna.sh  -1 r1.fq -2 r2.fq  -o out -b echk12  -s P -T "Escherichia coli K12"  -a AUTO 
    fq2dna.sh  -1 rA.1.fq -2 rA.2.fq -3 rB.1.fq.gz -4 rB.2.fq.gz  -o out -b name  -a NEXTERA

Notes

Example

In order to illustrate the usefulness of fq2dna and to better describe its output files, the following use case example describes its usage for (re)assembling the draft genome of Listeria monocytogenes 2HF33 (Duru et al. 2020).

All output files are available in the directory example/ (the four sequence files were compressed using gzip).

Downloading input files

Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (112 Mb and 128 Mb, respectively) can be downloaded using the following command lines:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_2.fastq.gz

As phiX genome was used as spike-in during Illumina sequencing (Duru et al. 2020), this putative contaminating sequence (5.4 kb) can be downloded using the following command line:

wget -O phiX.fasta "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_001422.1"

Running fq2dna

Below is a typical command line to run fq2dna on PE FASTQ files to deal with such a bacteria HTS data:

./fq2dna.sh -1 ERR4032786_1.fastq.gz -2 ERR4032786_2.fastq.gz -a NEXTERA -A phiX.fasta   \\
            -s B -T "Listeria monocytogenes 2HF33" -o 2HF33 -b Lm.2HF33 -t 12 -w /tmp

Of note, option -a NEXTERA was set to clip the Nextera XT technical oligonucleotides used during library preparation (see Duru et al. 2020), the putative contaminating sequence file phiX.fasta was set using option -A, and the strategy ‘Bacteria’ was trivially set (option -s B).

Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using fq2dna together with program and tool versions listed into example/program.versions.txt:

# fq2dna v21.06ac
# Copyright (C) 2016-2021 Institut Pasteur
# OS:   linux-gnu
# Bash: 4.4.19(1)-release
# input file(s):
> PE lib 1
+ FQ11: [111.9 Mb]  ERR4032786_1.fastq.gz
+ FQ12: [127.8 Mb]  ERR4032786_2.fastq.gz
# output directory
+ OUTDIR=2hf33
# tmp directory
+ TMP_DIR=/tmp/fq2dna.Lm.2HF33.Pe6bnviTj
# STRATEGY B: Bacteria
[00:01] Deduplicating, Trimming, Clipping, Decontaminating and Correcting reads ... [ok]
[02:37] Merging and/or Normalizing reads ... [ok]
[03:34] Approximating genome size ... [ok]
> 3077907 bps (N=3078225 M=3077589)
[03:38] Assembling genome with/without PE read merging (dnaM/dnaN) ... [ok]
> [dnaN]  covr=63   arl=239   k=21,33,55,77,99,121
> [dnaM]  covr=60   arl=261   k=21,33,55,77,99,121
[05:41] Comparing genome assemblies ... [ok]
> [dnaN]  Nseq=34   Nres=3099209   NG50=358956   auGN=325895
> [dnaM]  Nseq=35   Nres=3099182   NG50=358475   auGN=325314
> selecting dnaN
[05:42] Aligning reads against scaffolds ..... [ok]
> min. cov. cutoff: 39
> avg. cov. depth:  61.5760
> bimodality coef.: 0.222822
[06:07] Annotating scaffold sequences ... [ok]
> taxon:     Listeria monocytogenes 2HF33
> locus tag: LISMON2HF33
# output files:
+ de novo assembly (all scaffolds):       2hf33/Lm.2HF33.all.fasta
+ coverage profile summary:               2hf33/Lm.2HF33.cov.info.txt
+ de novo assembly (selected scaffolds):  2hf33/Lm.2HF33.scf.fasta
+ sequence stats (selected scaffolds):    2hf33/Lm.2HF33.scf.info.txt
+ de novo assembly (selected contigs):    2hf33/Lm.2HF33.agp.fasta
+ scaffolding info:                       2hf33/Lm.2HF33.agp
+ descriptive statistics:                 2hf33/Lm.2HF33.dna.info.txt
+ insert size statistics:                 2hf33/Lm.2HF33.isd.txt
+ annotation (selected scaffolds):        2hf33/Lm.2HF33.scf.gbk
+ annotation info:                        2hf33/Lm.2HF33.scf.gbk.info.txt
[08:31] exit 

This log output shows that the complete analysis was performed on 12 threads in less than 9 minutes (HTS read processing in < 4 minutes, de novo assemblies in ~2 minutes, and scaffold sequence post-processing in < 3 minutes). One can also observe that the selected de novo assembly was dnaN (i.e. greater NG50 and auNG metrics), therefore showing that the PE HTS read merging step (M) does not always lead to better genome assemblies.

Output files

The genome coverage profile summary file (Lm.2HF33.cov.info.txt; see excerpt below) shows an average coverage depth of ~61x, corresponding to a symmetrical coverage depth distribution with mode of 61x, as expected (default option -C 60). The bimodality coefficient is 0.223, far below the bimodality cutoff (i.e. 0.55; see e.g. Ellison 1987, Pfister et al. 2013), suggesting that no contamination occurred in the HTS reads selected for assembly.

=  observed coverage distribution:  no.pos=3099409
   avg=61.5760  skewness=-1.850311  excess.kurtosis=16.852882  bimodality.coef=0.222822

#  Poisson(l) coverage tail distribution:  l=0.26470588  w=0.00004388
*  GP(l',r) coverage distribution:  l'=91.38551745  r=-0.48001727  1-w=0.99995612
   min.cov.cutoff=39 (p-value<=0.00001)

These different statistics therefore assess that the assembled sequences are accurate and well-supported by sufficient data. Finally, a generalized Poisson (GP) distribution (that fits the observed genome coverage profile) enables to determine a minimum coverage cutoff (i.e. 39x) under which any assembled base is considered as putatively not trustworthy.

The tab-delimited file Lm.2HF33.dna.info.txt shows that the de novo assembly procedure led to quite few scaffold and contig sequences:

#File               Nseq Nres    A      C      G      T      N   %A     %C     %G     %T     %N    %AT    %GC    Min Q25  Med   Q75    Max    Avg       auN    N50    N75    N90   L50 L75 L90
Lm.2HF33.all.fasta  32   3099409 965680 586493 584967 962069 200 31.15% 18.92% 18.87% 31.04% 0.00% 62.21% 37.79% 325 467  5071  103424 609762 96856.53  369361 483167 125907 97509 3   7   11
Lm.2HF33.scf.fasta  25   3096280 964821 585834 584214 961211 200 31.16% 18.92% 18.86% 31.04% 0.00% 62.21% 37.79% 325 2377 61185 125907 609762 123851.20 369734 483167 125907 97509 3   7   11
Lm.2HF33.agp.fasta  27   3096080 964821 585834 584214 961211 0   31.16% 18.92% 18.86% 31.04% 0.00% 62.21% 37.79% 325 2334 61185 126569 532745 114669.62 323982 358956 125907 97509 4   8   12

Of note, seven small and/or insufficiently covered sequences (cutoff = 39x; see Lm.2HF33.cov.info.txt) were not selected for the final scaffold sequence set (i.e. Lm.2HF33.scf.fasta), therefore leading to a final assembled genome of total size 3,096,080 bps (Nres), represented in 25 scaffolds or 27 contigs (Nseq), with 37.79% GC-content.

The tab-delimited file Lm.2HF33.scf.info.txt (displayed below) shows different residue statistics for each final scaffold sequence. Note that each FASTA header (info files Lm.2HF33.all.fasta and Lm.2HF33.scf.fasta) contains the k-mer coverage depth returned by SPAdes (covk), the base coverage depth estimated by fq2dna (covr), and the (low) coverage cutoff derived from the coverage profile analysis (cutoff).

#Seq                                          Nres   A      C      G      T      N   %A     %C     %G     %T     %N    %AT    %GC     Pval    CPU
NODE_1_covk_30.406287_covr_61.662_cutoff_39   609762 203624 102734 125302 178002 100 33.39% 16.84% 20.54% 29.19% 0.01% 62.60% 37.40%  0.0034  C
NODE_2_covk_30.453294_covr_61.686_cutoff_39   532745 154855 110898 91560  175432 0   29.06% 20.81% 17.18% 32.92% 0.00% 62.00% 38.00%  0.6223  C
NODE_3_covk_30.272711_covr_61.550_cutoff_39   483167 152931 88007  93261  148968 0   31.65% 18.21% 19.30% 30.83% 0.00% 62.49% 37.51%  0.0385  C
NODE_4_covk_30.706364_covr_61.810_cutoff_39   358956 107395 74076  65004  112481 0   29.91% 20.63% 18.10% 31.33% 0.00% 61.26% 38.74%  0.0004  C
NODE_5_covk_30.542160_covr_61.815_cutoff_39   202873 66970  34564  41586  59753  0   33.01% 17.03% 20.49% 29.45% 0.00% 62.47% 37.53%  0.0148  C
NODE_6_covk_30.987260_covr_61.695_cutoff_39   126569 41312  21299  26286  37672  0   32.63% 16.82% 20.76% 29.76% 0.00% 62.41% 37.59%  0.0628  C
NODE_7_covk_30.597515_covr_61.821_cutoff_39   125907 36800  25733  22119  41255  0   29.22% 20.43% 17.56% 32.76% 0.00% 62.00% 38.00%  0.9722  C
NODE_8_covk_30.889345_covr_61.679_cutoff_39   109334 32660  22786  17973  35915  0   29.87% 20.84% 16.43% 32.84% 0.00% 62.73% 37.27%  0.0206  C
NODE_9_covk_30.877632_covr_61.422_cutoff_39   103424 29747  21597  16742  35338  0   28.76% 20.88% 16.18% 34.16% 0.00% 62.94% 37.06%  0.0079  C
NODE_10_covk_30.202183_covr_61.508_cutoff_39  99066  32740  17152  20508  28666  0   33.04% 17.31% 20.70% 28.93% 0.00% 61.99% 38.01%  0.9669  C
NODE_11_covk_30.532037_covr_61.653_cutoff_39  97509  28253  20285  17248  31723  0   28.97% 20.80% 17.68% 32.53% 0.00% 61.51% 38.49%  0.0206  C
NODE_12_covk_30.699550_covr_61.404_cutoff_39  64398  21741  11084  13359  18214  0   33.76% 17.21% 20.74% 28.28% 0.00% 62.05% 37.95%  0.7621  C
NODE_13_covk_29.544576_covr_60.693_cutoff_39  61185  21142  9717   12639  17687  0   34.55% 15.88% 20.65% 28.90% 0.00% 63.47% 36.53%  0.0000  P
NODE_14_covk_30.516466_covr_61.060_cutoff_39  46823  14085  9566   7968   15204  0   30.08% 20.43% 17.01% 32.47% 0.00% 62.56% 37.44%  0.1267  C
NODE_15_covk_30.758479_covr_61.756_cutoff_39  35062  10004  7708   6005   11345  0   28.53% 21.98% 17.12% 32.35% 0.00% 60.89% 39.11%  0.0190  C
NODE_16_covk_30.848137_covr_61.400_cutoff_39  22536  6000   4813   3643   8080   0   26.62% 21.35% 16.16% 35.85% 0.00% 62.48% 37.52%  0.1588  U
NODE_17_covk_35.036364_covr_60.303_cutoff_39  5071   1064   1467   1048   1392   100 20.98% 28.92% 20.66% 27.45% 1.97% 49.41% 50.59%  0.0000  U
NODE_18_covk_32.390317_covr_54.643_cutoff_39  2806   724    661    386    1035   0   25.80% 23.55% 13.75% 36.88% 0.00% 62.69% 37.31%  0.4884  U
NODE_19_covk_29.273936_covr_55.026_cutoff_39  2377   637    502    337    901    0   26.79% 21.11% 14.17% 37.90% 0.00% 64.71% 35.29%  0.0432  U
NODE_20_covk_30.051514_covr_56.107_cutoff_39  2334   775    370    416    773    0   33.20% 15.85% 17.82% 33.11% 0.00% 66.33% 33.67%  0.0042  U
NODE_21_covk_25.827869_covr_50.283_cutoff_39  2073   706    348    359    660    0   34.05% 16.78% 17.31% 31.83% 0.00% 65.90% 34.10%  0.0104  U
NODE_22_covk_34.048830_covr_54.113_cutoff_39  1104   395    198    256    255    0   35.77% 17.93% 23.18% 23.09% 0.00% 58.88% 41.12%  0.1347  U
NODE_25_covk_37.367052_covr_46.822_cutoff_39  467    133    88     61     185    0   28.47% 18.84% 13.06% 39.61% 0.00% 68.10% 31.90%  0.0461  U
NODE_29_covk_33.709790_covr_42.206_cutoff_39  407    80     99     70     158    0   19.65% 24.32% 17.19% 38.82% 0.00% 58.48% 41.52%  0.2084  C
NODE_32_covk_34.926471_covr_39.388_cutoff_39  325    48     82     78     117    0   14.76% 25.23% 24.00% 36.00% 0.00% 50.77% 49.23%  0.0171  U

The last column CPU also shows the classification of each scaffold sequence into the category ‘Chromosome’, ‘Plasmid’ or ‘Undetermined’ (i.e. C, P, U, respectively). Such a classification shows that the 13rd scaffold sequence (i.e. NODE_13) is likely a plasmid (subsequent BLAST searches indeed show that it is almost identical to pLM6179).

References

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477. doi:10.1089/cmb.2012.0021.

Duru IC, Andreevskaya M, Laine P, Rode TN, Ylinen A, Løvdal T, Bar N, Crauwels P, Riedel CU, Bucur FI, Nicolau AI, Auvinen P (2020) Genomic characterization of the most barotolerant Listeria monocytogenes RO15 strain compared to reference strains used to evaluate food high pressure processing. BMC Genomics, 21:455. doi:10.1186/s12864-020-06819-0.

Ellison AM (1987) Effect of seed dimorphism on the density-dependent dynamics of experimental populations of Atriplex triangularis (Chenopodiaceae). American Journal of Botany, 74(8):1280-1288. doi:10.2307/2444163.

Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147.

Pfister R, Schwarz KA, Janczyk M, Dale R, Freeman JB (2013) Good things peak in pairs: a note on the bimodality coefficient. Frontiers in Psychology, 4:700. doi:10.3389/fpsyg.2013.00700.

Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.

Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A (2020) Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10):mgen000398. doi:10.1099/mgen.0.000398.

Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. doi:10.1093/bioinformatics/btu153.