fqCleanER

fqCleanER (fastq Cleaning and Enhancing Routine) is a command line tool written in Bash to ease the different standard preprocessing steps of short high-throughput sequencing (HTS) reads.

Eight standard HTS read processing steps can be carried out using fqCleanER:

❶ contaminating HTS read removal, using AlienRemover,

❷ sequencing error correction, using Musket (Liu et al. 2013),

❸ HTS read deduplication, using fqduplicate from the fqtools package,

❹ low-coverage HTS read removal, using ROCK,

❺ digital normalization, using ROCK,

❻ paired-ends HTS read merging, using FLASh (Magoc and Salzberg 2011),

❼ high-coverage (redundant) HTS read reduction, using ROCK,

❽ HTS read trimming and clipping, using AlienTrimmer (Criscuolo and Brisse 2013).

All these steps can be performed in any order on up to three paired- and/or single-end FASTQ files (compressed or not).

fqCleanER runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.

program	package	version	sources
gawk	-	> 4.0.0	ftp.gnu.org/gnu/gawk

bzip2	-	> 1.0.0	sourceware.org/bzip2/downloads.html
DSRC	-	≥ 2.0	github.com/lrog/dsrc
gzip	-	> 1.5.0	ftp.gnu.org/gnu/gzip

AlienRemover	-	≥1.0	gitlab.pasteur.fr/GIPhy/AlienRemover
AlienTrimmer	-	> 2.0	gitlab.pasteur.fr/GIPhy/AlienTrimmer
FLASh	-	> 1.2.10	sourceforge.net/projects/flashpage
fqconvert fqduplicate fqextract fqstats	fqtools	≥ 1.1a	ftp.pasteur.fr/pub/gensoft/projects/fqtools
minion	kraken	15-065	wwwdev.ebi.ac.uk/enright-dev/kraken/reaper/src
Musket	-	≥ 1.1	sourceforge.net/projects/musket
ntCard	-	> 1.2	github.com/bcgsc/ntCard
ROCK	-	≥ 1.9.3	gitlab.pasteur.fr/vlegrand/ROCK

Installation and execution

A. Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/fqCleanER.git

B. Give the execute permission to the file fqCleanER.sh:

chmod +x fqCleanER.sh

C. Execute fqCleanER with the following command line model:

./fqCleanER.sh  [options]

D. If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), fqCleanER will exit with an error message. When running fqCleanER without option, a documentation should be displayed; otherwise, the name of the missing program is displayed before exiting. In such a case, edit the file fqCleanER.sh and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS (approximately lines 70-200). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS

program	variable assignment	program	variable assignment
AlienRemover	`ALIENREMOVER_BIN=AlienRemover;`	fqduplicate	`FQDUPLICATE_BIN=fqduplicate;`
AlienTrimmer	`ALIENTRIMMER_BIN=AlienTrimmer;`	fqextract	`FQEXTRACT_BIN=fqextract;`
bzip2	`BZIP2_BIN=bzip2;`	fqstats	`FQSTATS_BIN=fqstats;`
DSRC	`DSRC2_BIN=dsrc;`	minion	`MINION_BIN=minion;`
FLASh	`FLASH_BIN=flash;`	Musket	`MUSKET_BIN=musket;`
gawk	`GAWK_BIN=gawk;`	ntCard	`NTCARD_BIN=ntcard;`
gzip	`GZIP_BIN=gzip;`	ROCK	`ROCK_BIN=rock;`
fqconvert	`FQCONVERT_BIN=fqconvert;`

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as AlienTrimmer is a Java tool that can be run using a Java virtual machine, the executable jar file AlienTrimmer.jar can be used by fqCleanER after editing the corresponding variable assignment instruction as follows: ALIENTRIMMER_BIN="java -jar AlienTrimmer.jar".

Usage

Run fqCleanER without option to read the following documentation:

 USAGE:  fqCleanER.sh  [options] 

 OPTIONS:
  -1 <infile>   fwd (R1) FASTQ-formatted input file name from paired-ends (PE) library 1; every
                input file can  be uncompressed  (file extension: .fastq or .fq)  or compressed
                using either gzip (.gz), bzip2 (.bz or .bz2) or DSRC (.dsrc or .dsrc2)
  -2 <infile>   rev (R2) FASTQ-formatted input file name from PE library 1
  -3 <infile>   fwd (R1) FASTQ-formatted input file name from PE library 2
  -4 <infile>   rev (R2) FASTQ-formatted input file name from PE library 2
  -5 <infile>   fwd (R1) FASTQ-formatted input file name from PE library 3
  -6 <infile>   rev (R2) FASTQ-formatted input file name from PE library 3
  -7 <infile>   FASTQ-formatted input file name from single-end (SE) library 4
  -8 <infile>   FASTQ-formatted input file name from SE library 5
  -9 <infile>   FASTQ-formatted input file name from SE library 6
  -o <outdir>   path and name of the output directory (mandatory option)
  -b <string>   base name for output files (mandatory option)
  -a <infile>   to set a file containing every alien oligonucleotide sequence (one per line) to
                be clipped during step 'T' (see below)
  -a <string>   one or several key words  (separated with commas),  each corresponding to a set 
                of alien oligonucleotide sequences to be clipped during step 'T' (see below):
                   POLY               nucleotide homopolymers
                   NEXTERA            Illumina Nextera index Kits
                   IUDI               Illumina Unique Dual index Kits
                   AMPLISEQ           AmpliSeq for Illumina Panels
                   TRUSIGHT_PANCANCER Illumina TruSight RNA Pan-Cancer Kits
                   TRUSEQ_UD          Illumina TruSeq Unique Dual index Kits
                   TRUSEQ_CD          Illumina TruSeq Combinatorial Dual index Kits
                   TRUSEQ_SINGLE      Illumina TruSeq Single index Kits
                   TRUSEQ_SMALLRNA    Illumina TruSeq Small RNA Kits
                Note that these sets of alien oligonucleotide  sequences are not exhaustive and
                will never  replace the  exact oligos  used for  library preparation  (default:
                "POLY")
  -a AUTO       to perform  de novo  inference of  3' alien  oligonucleotide sequence(s)  of at 
                least 20 nucleotide length;  selected sequences  are completed  with those from 
                "POLYS" (see above)                
  -A <infile>   to set sequence or k-mer  model file(s)  to carry out  contaminant read removal 
                during step 'C';  several comma-separated file names can be specified;  allowed 
                file extensions: .fa, .fasta, .fna, .kmr or .kmz (default: phiX174 genome)
  -d <string>   displays the alien oligonucleotide sequences corresponding to the specified key
                word(s); see option -a for the list of available key words
  -q <int>      quality score threshold;  all bases with Phred  score below  this threshold are 
                considered as non-confident (default: 15)
  -l <int>      minimum required length for a read (default: 50)
  -p <int>      maximum allowed percentage  of non-confident bases  (as ruled by option -q) per 
                read (default: 50) 
  -c <int>      minimum allowed coverage depth for step 'L' or 'N' (default: 4)
  -C <int>      maximum allowed coverage depth for step 'R' or 'N' (default: 90)
  -s <string>   a sequence of tasks  to be iteratively performed,  each being defined by one of 
                the following uppercase characters:
                   C   discarding [C]ontaminating reads (as ruled by option -A)
                   E   correcting sequencing [E]rrors
                   D   [D]eduplicating reads
                   L   discarding [L]ow-coverage reads (as ruled by option -c)
                   N   digital [N]ormalization (i.e. same as consecutive steps "RL")
                   M   [M]erging overlapping PE reads
                   R   [R]educing redundancy (as ruled by option -C)
                   T   [T]rimming and clipping (as ruled by options -q, -l, -p, -a)
                (default: "T")
  -z <string>   compressed output  file(s) using  gzip ("gz"),  bzip2 ("bz2")  or DSRC ("dsrc")
                (default: not compressed)
  -t <int>      number of threads (default: 12)
  -w <dir>      tmp directory (default: \$TMPDIR, otherwise /tmp)
  -h            prints this help and exit

 EXAMPLES:
   fqCleanER.sh  -4 se.fq -o out -b se.flt
   fqCleanER.sh  -1 r1.fq -2 r2.fq -o out -b pe.flt
   fqCleanER.sh  -1 r1.fq -2 r2.fq -a NEXTERA -q 20 -s DTENM -o out -b flt -z gz

Notes

fqCleanER is able to consider up to three paired-ends libraries (PE; options -1 to -6) and three single-end libraries (SE; options -7 to -9). Input files should be in FASTQ format and can be compressed using gzip, bzip2 or DSRC (Roguski and Deorowicz 2014). All input files are always converted into Phred+33 format using fqconvert.
Output files are defined by a specified prefix (mandatory option -b) and written in a specified output directory (mandatory option -o). Output files can be compressed using gzip, bzip2 or DSRC (option -z).
Temporary files are written into a dedicated directory created into the $TMPDIR directory (when defined, otherwise /tmp). When possible, it is highly recommended to set a temp directory with large capacity (option -w).
The cleaning/enhancing steps can be specified using option -s in any order. The same step can be specified several times (e.g. -s DTDNEN).

[C] Contaminating HTS read removal (-s C) is performed using AlienRemover with default options. Contaminating sequences/k-mers are specified using option -A. When contaminating sequences are quite short (e.g. virus genomes), they can be directly specified via a FASTA-formatted file without affecting the overall running time. However, to consider large contaminating sequences (e.g. human genomes), it is highly recommended to precompute the corresponding k-mer set using AlienRemover and specify the corresponding k-mer file (kmr/kmz file extension) to observe fast running times. If the option -A is not set, calling step C enables to remove phi-X174 HTS reads.

[D] HTS read deduplication (-s D) is performed using fqduplicate. Note that a pair of PE reads (R1,R2) and (R1’,R2’) are considered as duplicated (i.e. identical) when R1 = R1’ and R2 = R2’.

[E] Sequencing error correction (-s E) is performed using Musket (Liu et al. 2013) with k-mer length k = 21. This step generally requires quite important running times and will benefit from a large number of threads (option -t).

[L][N][R] These three steps (-s L, -s N, -s R, respectively) are related to the digital normalization procedure (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019), performed using ROCK with k-mer length k = 25. Given a lower-bound and a upper-bound coverage depth thresholds (options -c and -C, respectively), the digital normalization selects a subset of HTS reads such that every sequenced base has a coverage depth between these two bounds. When setting a moderate upper-bound (that is lower than the overall average coverage depth; default: -C 90), every sequenced base from the selected HTS read subset is expected to have a coverage depth close to this bound. When setting a small lower-bound (default: -c 4), all HTS reads corresponding to a sequenced region with coverage depth lower than this bound will be discarded (e.g. artefactual or erroneous HTS read, low-coverage contaminating HTS read). Step N (-s N) uses the two bounds (options -C and -c), whereas steps L/R (-s L and -s R, respectively) use only the lower-/upper-bound, respectively.

[M] PE HTS read merging (-s M, only with PE input files) is performed using FLASh (Magoc and Salzberg 2011) when the insert size is shorter than the sum of the two paired HTS read lengths. When using this step, dedicated output files are written (.M.fastq file extension).

[T] Trimming and clipping (-s T) are performed using AlienTrimmer (Criscuolo and Brisse 2013). Clipping is carried out based on the specified alien oligonucleotides (option -a), where alien oligonucleotide sequences can be (i) set using precomputed standard library names, (ii) specified via user-defined FASTA-formatted file, or (iii) directly estimated from the input files using minion (option -a AUTO). When step T is run without setting option -a, clipping is carried out with the four homopolymers as alien oligonucleotides. Trimming is carried out by deleting 5’ and 3’ regions containing many non-confident bases, where a base is considered as non-confident when its Phred score is lower than a Phred score threshold (set using option -q; default: 15). After trimming/clipping an HTS read, it can be discarded when the number of remaining bases is lower than a specified length threshold (option -l; default: 50 bases) or when the percentage of remaining non-confident bases is higher than another specified threshold (option -p; default: 50%). Note that when HTS read discarding breaks PE, singletons are written into dedicated output files (.S.fastq file extension).
Each predefined set of alien oligonucleotide sequences can be displayed using option -d. Some sets of alien oligonucleotide sequences are derived from ‘Illumina Adapter Sequences’ Document # 1000000002694 v16, i.e. options -a NEXTERA (Nextera DNA Indexes), -a IUDI (IDT for Illumina UD Indexes), -a AMPLISEQ (AmpliSeq for Illumina Panels), -a TRUSIGHT_PANCANCER (TruSight RNA Pan-Cancer Panel), -a TRUSEQ_UD (IDT for Illumina-TruSeq DNA and RNA UD Indexes), -a TRUSEQ_CD (TruSeq DNA and RNA CD Indexes), -a TRUSEQ_SINGLE (TruSeq Single Indexes), and -a TRUSEQ_SMALLRNA (TruSeq Small RNA).
^{_{[Oligonucleotide sequences © 2021 Illumina, Inc. All rights reserved. Derivative works created by Illumina customers are authorized for use with Illumina instruments and products only. All other uses are strictly prohibited.]}}

References

Brown TC, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802.

Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011.

Durai DA, Schulz MH (2019) Improving in-silico normalization using read weights. Scientific Reports, 9:5133. doi:10.1038/s41598-019-41502-9.

Liu Y, Schröder J, Schmidt B (2013) Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics, 29(3):308-315. doi:10.1093/bioinformatics/bts690.

Magoc T, Salzberg S (2011) FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27:21:2957-2963. doi:10.1093/bioinformatics/btr507.

Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.

Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinformatics, 18:324. doi:10.1186/s12859-017-1724-7.