fqCleanER (fastq Cleaning and Enhancing Routine) is a command line tool written in Bash to ease the different standard preprocessing steps of short high-throughput sequencing (HTS) reads.
Eight standard HTS read processing steps can be carried out:
❶ HTS read [D]eduplication, using
fqduplicate from the fqtools
package,
❷ HTS read [T]rimming and clipping,
using AlienTrimmer
(Criscuolo and Brisse 2013),
❸ paired-ends HTS
read [M]erging, using FLASh (Magoc and
Salzberg 2011),
❹ [C]ontaminating HTS read
removal, using AlienRemover,
❺ sequencing [E]rror correction, using Musket
(Liu et al. 2013),
❻ [L]ow-coverage HTS read
removal, using ROCK
(Legrand et al. 2022a, 2022b),
❼ digital
[N]ormalization, using ROCK
(Legrand et al. 2022a, 2022b),
❽ high-coverage
(redundant) HTS read [R]eduction, using ROCK
(Legrand et al. 2022a, 2022b).
All these steps can be performed in any order on up to three paired- and/or single-end FASTQ files (compressed or not).
fqCleanER runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.
program | package | version | sources |
---|---|---|---|
gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
xargs ★ | GNU findutils | ≥ 4.6.0 | ftp.gnu.org/gnu/findutils |
bzip2 | - | > 1.0.0 | sourceware.org/bzip2/downloads.html |
DSRC | - | ≥ 2.0 | github.com/refresh-bio/DSRC |
gzip | - | > 1.5.0 | ftp.gnu.org/gnu/gzip |
AlienDiscover | - | ≥ 0.1 | gitlab.pasteur.fr/GIPhy/AlienDiscover |
AlienRemover | - | ≥ 1.0 | gitlab.pasteur.fr/GIPhy/AlienRemover |
AlienTrimmer | - | ≥ 2.1 | gitlab.pasteur.fr/GIPhy/AlienTrimmer |
FLASh | - | > 1.2.10 | sourceforge.net/projects/flashpage |
fqconvert fqduplicate fqextract fqstats |
fqtools | ≥ 1.1a | ftp.pasteur.fr/pub/gensoft/projects/fqtools |
Musket ✦ | - | ≥ 1.1 | sourceforge.net/projects/musket |
ntCard | - | > 1.2 | github.com/bcgsc/ntCard |
ROCK | - | ≥ 1.9.3 | gitlab.pasteur.fr/vlegrand/ROCK |
★ For some Mac OS X, it is
worth noting that the default BSD xargs
does not offer all the functionalities required by fqCleanER.
However, the expected GNU
xargs (here named gxargs
) can be easily
installed using homebrew
(i.e. brew install findutils
). Of note, fqCleanER
first looks for the gxargs
binary on the
$PATH
, and, if missing, for the xargs
binary.
✦ When
compiling the source code of Musket,
it is recommended to edit its Makefile to increase the value of
the macro MAX_SEQ_LENGTH
(e.g. 1000) in order to avoid any
problem during the execution of fqCleanER.
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fqCleanER.git
B. Give the execute permission to the file
fqCleanER.sh
:
chmod +x fqCleanER.sh
C. Execute fqCleanER with the following command line model:
./fqCleanER.sh [options]
D. If at least one of the required program (see Dependencies) is not available on your
$PATH
variable (or if one compiled binary has a different
default name), fqCleanER will exit with an error message. When
running fqCleanER without option, a usage documentation should
be displayed (see below); otherwise, the name of the missing program is
displayed before exiting. In such a case, edit the file
fqCleanER.sh
and indicate the local path to the
corresponding binary(ies) within the code block
REQUIREMENTS
(approximately lines 80-220). For each
required program, the table below reports the corresponding variable
assignment instruction to edit (if needed) within the code block
REQUIREMENTS
program | variable assignment | program | variable assignment | |
---|---|---|---|---|
AlienDiscover | ALIENDISCOVER_BIN=AlienDiscover; |
fqconvert | FQCONVERT_BIN=fqconvert; |
|
AlienRemover | ALIENREMOVER_BIN=AlienRemover; |
fqduplicate | FQDUPLICATE_BIN=fqduplicate; |
|
AlienTrimmer | ALIENTRIMMER_BIN=AlienTrimmer; |
fqextract | FQEXTRACT_BIN=fqextract; |
|
bzip2 | BZIP2_BIN=bzip2; |
fqstats | FQSTATS_BIN=fqstats; |
|
DSRC | DSRC2_BIN=dsrc; |
Musket | MUSKET_BIN=musket; |
|
FLASh | FLASH_BIN=flash; |
ntCard | NTCARD_BIN=ntcard; |
|
gawk | GAWK_BIN=gawk; |
ROCK | ROCK_BIN=rock; |
|
gzip | GZIP_BIN=gzip; |
xargs | XARGS_BIN=xargs; GXARGS_BIN=gxargs;
(OS X) |
Note that depending on the installation of some required programs,
the corresponding variable can be assigned with complex commands. For
example, as AlienTrimmer is a Java tool that can be run using a
Java virtual machine, the executable jar file
AlienTrimmer.jar
can be used by fqCleanER after
editing the corresponding variable assignment instruction as follows:
ALIENTRIMMER_BIN="java -jar AlienTrimmer.jar"
.
Run fqCleanER without option to read the following documentation:
USAGE: fqCleanER.sh [options]
OPTIONS:
-1 <infile> fwd (R1) FASTQ input file name from PE library 1 | input files can be |
-2 <infile> rev (R2) FASTQ input file name from PE library 1 | uncompressed (file |
-3 <infile> fwd (R1) FASTQ input file name from PE library 2 | extensions .fastq |
-4 <infile> rev (R2) FASTQ input file name from PE library 2 | or .fq), or |
-5 <infile> fwd (R1) FASTQ input file name from PE library 3 | compressed using |
-6 <infile> rev (R2) FASTQ input file name from PE library 3 | either gzip (.gz), |
-7 <infile> FASTQ input file name from SE library 4 | bzip2 (.bz2 or |
-8 <infile> FASTQ input file name from SE library 5 | .bz), or DSRC |
-9 <infile> FASTQ input file name from SE library 6 | (.dsrc or .dsrc2) |
-o <outdir> path and name of the output directory (mandatory option)
-b <string> base name for output files (mandatory option)
-a <infile> to set a file containing every alien oligonucleotide sequence (one per line) to
be clipped during step 'T' (see below)
-a <string> one or several key words (separated with commas), each corresponding to a set
of alien oligonucleotide sequences to be clipped during step 'T' (see below):
POLY nucleotide homopolymers
NEXTERA Illumina Nextera index Kits
IUDI Illumina Unique Dual index Kits
AMPLISEQ AmpliSeq for Illumina Panels
TRUSIGHT_PANCANCER Illumina TruSight RNA Pan-Cancer Kits
TRUSEQ_UD Illumina TruSeq Unique Dual index Kits
TRUSEQ_CD Illumina TruSeq Combinatorial Dual index Kits
TRUSEQ_SINGLE Illumina TruSeq Single index Kits
TRUSEQ_SMALLRNA Illumina TruSeq Small RNA Kits
Note that these sets of alien sequences are not exhaustive and will never
replace the exact oligos used for library preparation (default: "POLY")
-a AUTO to perform de novo inference of 3' alien oligonucleotide sequence(s) of at
least 20 nucleotide length; selected sequences are completed with those from
"POLY" (see above)
-A <infile> to set sequence or k-mer model file(s) to carry out contaminant read removal
during step 'C'; several comma-separated file names can be specified; allowed
file extensions: .fa, .fasta, .fna, .kmr or .kmz (default: phiX174 genome)
-d <string> displays the alien oligonucleotide sequences corresponding to the specified key
word(s); see option -a for the list of available key words
-q <int> quality score threshold; all bases with Phred score below this threshold are
considered as non-confident (default: 15)
-l <int> minimum required length for a read (default: half the average read length)
-p <int> maximum allowed percentage of non-confident bases (as ruled by option -q) per
read (default: 50)
-c <int> minimum allowed coverage depth for step 'L' or 'N' (default: 4)
-C <int> maximum allowed coverage depth for step 'R' or 'N' (default: 90)
-s <string> a sequence of tasks to be iteratively performed, each being defined by one of
the following uppercase characters:
C discarding [C]ontaminating reads (as ruled by option -A)
E correcting sequencing [E]rrors
D [D]eduplicating reads
L discarding [L]ow-coverage reads (as ruled by option -c)
N digital [N]ormalization (i.e. same as consecutive steps "RL")
M [M]erging overlapping PE reads
R [R]educing redundancy (as ruled by option -C)
T [T]rimming and clipping (as ruled by options -q, -l, -p, -a)
(default: "T")
-z <string> compressed output file(s) using gzip ("gz"), bzip2 ("bz2") or DSRC ("dsrc")
(default: not compressed)
-t <int> number of threads (default: 12)
-w <dir> tmp directory (default: $TMPDIR, otherwise /tmp)
-h prints this help and exit
EXAMPLES:
fqCleanER.sh -4 se.fq -o out -b se.flt
fqCleanER.sh -1 r1.fq -2 r2.fq -o out -b pe.flt
fqCleanER.sh -1 r1.fq -2 r2.fq -a NEXTERA -q 20 -s DTENM -o out -b flt -z gz
fqCleanER is able to consider up to three paired-ends
libraries (PE; options -1
to -6
) and three
single-end libraries (SE; options -7
to -9
).
Input files should be in FASTQ format and can be compressed using gzip, bzip2 or DSRC
(Roguski and Deorowicz 2014). All input files are always converted into
Phred+33 format using fqconvert.
Output files are defined by a specified prefix (mandatory option
-b
) and written in a specified output directory (mandatory
option -o
). Output files can be compressed using gzip, bzip2 or DSRC
(option -z
).
Temporary files are written into a dedicated directory created
into the $TMPDIR
directory (when defined, otherwise
/tmp
). When possible, it is highly recommended to set a
temp directory with large capacity (option -w
).
The cleaning/enhancing steps can be specified using option
-s
in any order. The same step can be specified several
times (e.g. -s DTDNEN
).
[C]
Contaminating HTS read removal (-s C
) is performed using AlienRemover
with default options. Contaminating sequences/k-mers are
specified using option -A
. When contaminating sequences are
quite short (e.g. virus genomes), they can be directly specified via a
FASTA-formatted file without affecting the overall running time.
However, to consider large contaminating sequences (e.g. human genomes),
it is highly recommended to precompute the corresponding k-mer
set using AlienRemover
and specify the corresponding k-mer file (kmr/kmz file
extension) to observe fast running times. If the option -A
is not set, calling step C enables to remove phi-X174
HTS reads.
[D] HTS
read deduplication (-s D
) is performed using fqduplicate.
Note that a pair of PE reads (R1,R2) and (R1’,R2’) are considered as
duplicated (i.e. identical) when R1 = R1’ and R2 = R2’.
[E]
Sequencing error correction (-s E
) is performed using Musket
(Liu et al. 2013) with k-mer length k = 21. This step
generally requires quite important running times and will benefit from a
large number of threads (option -t
).
[L][N][R]
These three steps (-s L
, -s N
,
-s R
, respectively) are related to the digital
normalization procedure (Brown et al. 2012), performed using ROCK
(Legrand et al. 2022a, 2022b) with k-mer length k =
25. Given a lower-bound and a upper-bound coverage depth thresholds
(options -c
and -C
, respectively), the digital
normalization selects a subset of HTS reads such that every sequenced
base has a coverage depth between these two bounds. When setting a
moderate upper-bound (that is lower than the overall average coverage
depth; default: -C 90
), every sequenced base from the
selected HTS read subset is expected to have a coverage depth close to
this bound. When setting a small lower-bound (default:
-c 4
), all HTS reads corresponding to a sequenced region
with coverage depth lower than this bound will be discarded
(e.g. artefactual or erroneous HTS read, low-coverage contaminating HTS
read). Step N (-s N
) uses the two bounds (options
-C
and -c
), whereas steps L/R
(-s L
and -s R
, respectively) use only the
lower-/upper-bound, respectively.
[M] PE
HTS read merging (-s M
, only with PE input files) is
performed using FLASh (Magoc and
Salzberg 2011) when the insert size is shorter than the sum of the two
paired HTS read lengths. When using this step, dedicated output files
are written ( .M.fastq file extension).
[T]
Trimming and clipping (-s T
) are performed using AlienTrimmer
(Criscuolo and Brisse 2013). Clipping is carried out based on the
specified alien oligonucleotides (option -a
), where alien
oligonucleotide sequences can be (i) set using precomputed standard
library names, (ii) specified via user-defined FASTA-formatted file, or
(iii) directly estimated from the input files using AlienDiscover
(option -a AUTO
). When step T is run without setting option
-a
, clipping is carried out with the four homopolymers
(POLY
) as alien oligonucleotides. Trimming is carried out
by deleting 5’ and 3’ regions containing many non-confident bases, where
a base is considered as non-confident when its Phred score is lower than
a Phred score threshold (set using option -q
; default: 15).
After trimming/clipping an HTS read, it can be discarded when the number
of remaining bases is lower than a specified length threshold (option
-l
; default: half the average read length) or when the
percentage of remaining non-confident bases is higher than another
specified threshold (option -p
; default: 50%). Note that
when HTS read discarding breaks PE, singletons are written into
dedicated output files ( .S.fastq file extension).
Each predefined set of alien oligonucleotide sequences can be
displayed using option -d
. Some sets of alien
oligonucleotide sequences are derived from ‘Illumina Adapter
Sequences’ Document
# 1000000002694 v16, i.e. options -a NEXTERA
(Nextera DNA Indexes), -a IUDI
(IDT for
Illumina UD Indexes), -a AMPLISEQ
(AmpliSeq for
Illumina Panels), -a TRUSIGHT_PANCANCER
(TruSight
RNA Pan-Cancer Panel), -a TRUSEQ_UD
(IDT for
Illumina-TruSeq DNA and RNA UD Indexes), -a TRUSEQ_CD
(TruSeq DNA and RNA CD Indexes), -a TRUSEQ_SINGLE
(TruSeq Single Indexes), and -a TRUSEQ_SMALLRNA
(TruSeq Small RNA).
[Oligonucleotide
sequences © 2021 Illumina, Inc. All rights reserved. Derivative works
created by Illumina customers are authorized for use with Illumina
instruments and products only. All other uses are strictly
prohibited.]
Brown TC, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802.
Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011.
Legrand V, Kergrohen T, Joly N, Criscuolo A (2022a) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.
Legrand V, Kergrohen T, Joly N, Criscuolo A (2022b) ROCK: digital normalization of whole genome sequencing data. In: Lemaitre C, Becker E, Derrien T (eds), Proceedings of JOBIM 2022, Rennes, France, p. 21. doi:10.14293/S2199-1006.1.SOR-.PPNAZX5.v1.
Liu Y, Schröder J, Schmidt B (2013) Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics, 29(3):308-315. doi:10.1093/bioinformatics/bts690.
Magoc T, Salzberg S (2011) FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27:21:2957-2963. doi:10.1093/bioinformatics/btr507.
Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.