Dependencies | Installation and execution | Usage | Notes | References | Citations

GPLv3 license Bash


fqCleanER (fastq Cleaning and Enhancing Routine) is a command line tool written in Bash to ease the different standard preprocessing steps of short high-throughput sequencing (HTS) reads.

Eight standard HTS read processing steps can be carried out:
  ❶   HTS read [D]eduplication, using fqduplicate from the fqtools package,
  ❷   HTS read [T]rimming and clipping, using AlienTrimmer (Criscuolo and Brisse 2013),
  ❸   paired-ends HTS read [M]erging, using FLASh (Magoc and Salzberg 2011),
  ❹   [C]ontaminating HTS read removal, using AlienRemover,
  ❺   sequencing [E]rror correction, using Musket (Liu et al. 2013),
  ❻   [L]ow-coverage HTS read removal, using ROCK (Legrand et al. 2022a, 2022b),
  ❼   digital [N]ormalization, using ROCK (Legrand et al. 2022a, 2022b),
  ❽   high-coverage (redundant) HTS read [R]eduction, using ROCK (Legrand et al. 2022a, 2022b).

All these steps can be performed in any order on up to three paired- and/or single-end FASTQ files (compressed or not).

fqCleanER runs on UNIX, Linux and most OS X operating systems.


You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.

program package version sources
gawk - > 4.0.0
xargs ★ GNU findutils ≥ 4.6.0
bzip2 - > 1.0.0
DSRC - ≥ 2.0
gzip - > 1.5.0
AlienDiscover - ≥ 0.1
AlienRemover - ≥ 1.0
AlienTrimmer - ≥ 2.1
FLASh - > 1.2.10
fqtools ≥ 1.1a
Musket ✦ - ≥ 1.1
ntCard - > 1.2
ROCK - ≥ 2.1

 ★ For some Mac OS X, it is worth noting that the default BSD xargs does not offer all the functionalities required by fqCleanER. However, the expected GNU xargs (here named gxargs) can be easily installed using homebrew (i.e. brew install findutils). Of note, fqCleanER first looks for the gxargs binary on the $PATH, and, if missing, for the xargs binary.
 ✦ When compiling the source code of Musket, it is recommended to edit its Makefile to increase the value of the macro MAX_SEQ_LENGTH (e.g. 1000) in order to avoid any problem during the execution of fqCleanER.

Installation and execution

A. Clone this repository with the following command line:

git clone

B. Give the execute permission to the file

chmod +x

C. Execute fqCleanER with the following command line model:

./  [options]

D. If at least one of the required program (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), fqCleanER will exit with an error message. When running fqCleanER without option, a usage documentation should be displayed (see below); otherwise, the name of the missing program is displayed before exiting. In such a case, edit the file and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS (approximately lines 80-220). For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS

program variable assignment program variable assignment
AlienDiscover ALIENDISCOVER_BIN=AlienDiscover; fqconvert FQCONVERT_BIN=fqconvert;
AlienRemover ALIENREMOVER_BIN=AlienRemover; fqduplicate FQDUPLICATE_BIN=fqduplicate;
AlienTrimmer ALIENTRIMMER_BIN=AlienTrimmer; fqextract FQEXTRACT_BIN=fqextract;
bzip2 BZIP2_BIN=bzip2; fqstats FQSTATS_BIN=fqstats;
DSRC DSRC2_BIN=dsrc; Musket MUSKET_BIN=musket;
FLASh FLASH_BIN=flash; ntCard NTCARD_BIN=ntcard;
gawk GAWK_BIN=gawk; ROCK ROCK_BIN=rock;
gzip GZIP_BIN=gzip; xargs XARGS_BIN=xargs;
GXARGS_BIN=gxargs; (OS X)

Note that depending on the installation of some required programs, the corresponding variable can be assigned with complex commands. For example, as AlienTrimmer is a Java tool that can be run using a Java virtual machine, the executable jar file AlienTrimmer.jar can be used by fqCleanER after editing the corresponding variable assignment instruction as follows: ALIENTRIMMER_BIN="java -jar AlienTrimmer.jar".


Run fqCleanER without option to read the following documentation:

 USAGE:  [options] 

  -1 <infile>   fwd (R1) FASTQ input file name from PE library 1         | input files can be |
  -2 <infile>   rev (R2) FASTQ input file name from PE library 1         | uncompressed (file |
  -3 <infile>   fwd (R1) FASTQ input file name from PE library 2         | extensions  .fastq |
  -4 <infile>   rev (R2) FASTQ input file name from PE library 2         | or    .fq),     or |
  -5 <infile>   fwd (R1) FASTQ input file name from PE library 3         | compressed   using |
  -6 <infile>   rev (R2) FASTQ input file name from PE library 3         | either gzip (.gz), |
  -7 <infile>   FASTQ input file name from SE library 4                  | bzip2   (.bz2   or |
  -8 <infile>   FASTQ input file name from SE library 5                  | .bz),   or    DSRC |
  -9 <infile>   FASTQ input file name from SE library 6                  | (.dsrc  or .dsrc2) |
  -o <outdir>   path and name of the output directory (mandatory option)
  -b <string>   base name for output files (mandatory option)
  -a <infile>   to set a file containing every alien oligonucleotide sequence (one per line) to
                be clipped during step 'T' (see below)
  -a <string>   one or several key words  (separated with commas),  each corresponding to a set
                of alien oligonucleotide sequences to be clipped during step 'T' (see below):
                   POLY                nucleotide homopolymers
                   NEXTERA             Illumina Nextera index Kits
                   IUDI                Illumina Unique Dual index Kits
                   AMPLISEQ            AmpliSeq for Illumina Panels
                   TRUSIGHT_PANCANCER  Illumina TruSight RNA Pan-Cancer Kits
                   TRUSEQ_UD           Illumina TruSeq Unique Dual index Kits
                   TRUSEQ_CD           Illumina TruSeq Combinatorial Dual index Kits
                   TRUSEQ_SINGLE       Illumina TruSeq Single index Kits
                   TRUSEQ_SMALLRNA     Illumina TruSeq Small RNA Kits
                Note that  these sets  of alien  sequences are  not  exhaustive  and will never
                replace the exact oligos used for library preparation  (default: "POLY")
  -a AUTO       to perform  de novo  inference of  3' alien  oligonucleotide sequence(s)  of at
                least 20 nucleotide length;  selected sequences  are completed  with those from
                "POLY" (see above)
  -A <infile>   to set sequence or k-mer  model file(s)  to carry out  contaminant read removal
                during step 'C';  several comma-separated file names can be specified;  allowed
                file extensions: .fa, .fasta, .fna, .kmr or .kmz (default: phiX174 genome)
  -d <string>   displays the alien oligonucleotide sequences corresponding to the specified key
                word(s); see option -a for the list of available key words
  -q <int>      quality score threshold;  all bases with Phred  score below  this threshold are
                considered as non-confident (default: 15)
  -l <int>      minimum required length for a read (default: half the average read length)
  -p <int>      maximum allowed percentage  of non-confident bases  (as ruled by option -q) per
                read (default: 50) 
  -c <int>      minimum allowed coverage depth for step 'L' or 'N' (default: 4)
  -C <int>      maximum allowed coverage depth for step 'R' or 'N' (default: 90)
  -s <string>   a sequence of tasks  to be iteratively performed,  each being defined by one of
                the following uppercase characters:
                   C   discarding [C]ontaminating reads (as ruled by option -A)
                   E   correcting sequencing [E]rrors
                   D   [D]eduplicating reads
                   L   discarding [L]ow-coverage reads (as ruled by option -c)
                   N   digital [N]ormalization (i.e. same as consecutive steps "RL")
                   M   [M]erging overlapping PE reads
                   R   [R]educing redundancy (as ruled by option -C)
                   T   [T]rimming and clipping (as ruled by options -q, -l, -p, -a)
                (default: "T")
  -z <string>   compressed output  file(s) using  gzip ("gz"),  bzip2 ("bz2")  or DSRC ("dsrc")
                (default: not compressed)
  -t <int>      number of threads (default: 12)
  -w <dir>      tmp directory (default: $TMPDIR, otherwise /tmp)
  -h            prints this help and exit

 EXAMPLES:  -4 se.fq -o out -b se.flt  -1 r1.fq -2 r2.fq -o out -b pe.flt  -1 r1.fq -2 r2.fq -a NEXTERA -q 20 -s DTENM -o out -b flt -z gz



Brown TC, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802.

Criscuolo A, Brisse S (2013) AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011.

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022a) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022b) ROCK: digital normalization of whole genome sequencing data. In: Lemaitre C, Becker E, Derrien T (eds), Proceedings of JOBIM 2022, Rennes, France, p. 21. doi:10.14293/S2199-1006.1.SOR-.PPNAZX5.v1.

Liu Y, Schröder J, Schmidt B (2013) Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics, 29(3):308-315. doi:10.1093/bioinformatics/bts690.

Magoc T, Salzberg S (2011) FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27:21:2957-2963. doi:10.1093/bioinformatics/btr507.

Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.


Abou Fayad A, Rafei R, Njamkepo E, Ezzeddine J, Hussein H, Sinno S, Gerges J-R, Barada S, Sleiman A, Assi M, Baakliny M, Hamedeh L, Mahfouz R, Dabboussi F, Feghali R, Mohsen Z, Rady A, Ghosn N, Abiad F, Abubakar A, Barakat A, Wauquier N, Quilici M-L, Hamze M, Weill F-X, Matar GM (2024) An unusual two-strain cholera outbreak in Lebanon, 2022-2023: a genomic epidemiology study. Nature Communications, 15:6963. doi:10.1038/s41467-024-51428-0

Bouchier C, Touak G, Rei D, Clermont D (2024) Complete genome sequence of two Christensenella minuta strains CIP 112228 and CIP 112229, isolated from human fecal samples. Microbiology Resource Announcements, 13(12):e00766-24. doi:10.1128/mra.00766-24

Charlier C, Noel C, Hafner L, Moura A, Mathiaud C, Pitsch A, Meziane C, Jolly-Sanchez L, de Pontfarcy A, Diamantis S, Bracq-Dieye H, Disson O, Thouvenot P, Valès G, Tessaud-Rita N, Tourdjman M, Leclercq A, Lecuit M (2023) Fatal neonatal listeriosis following L. monocytogenes horizontal transmission highlights neonatal susceptibility to orally acquired listeriosis. Cell Reports Medicine, 4(7):101094. doi:10.1016/j.xcrm.2023.101094

Chavarría-Pizarro L, Rivera-Méndez W, Núñez-Montero K, Pizarro-Cerdá J (2024) Novel strains of Actinobacteria associated with neotropical social wasps (Vespidae; Polistinae, Epiponini) with antimicrobial potential for natural product discovery. FEMS Microbes, 5:xtae005. doi:10.1093/femsmc/xtae005

Gravey F, Sévin C, Castagnet S, Foucher N, Maillard K, Tapprest J, Léon A, Langlois B, Le Hello S, Petry S (2024) Antimicrobial resistance and genetic diversity of Klebsiella pneumoniae strains from different clinical sources in horses. Frontiers in Microbiology, 14:1334555. doi:10.3389/fmicb.2023.1334555

Jalalizadeh F, Njamkepo E, Weill F-X, Goodarzi F, Rahnamaye-Farzami M, Sabourian R, Bakhshi B (2024) -Genetic approach toward linkage of Iran 2012–2016 cholera outbreaks with 7th pandemic Vibrio cholerae_. BMC Microbiology, 24:33. doi:10.1186/s12866-024-03185-9

Markovich Y, Palacios-Gorba C, Gomis J, Gómez-Martín A, Ortolá S, Quereda JJ (2024) Phenotypic and genotypic antimicrobial resistance of Listeria spp. in Spain. Veterinary Microbiology, 293:110086. doi:10.1016/j.vetmic.2024.110086

Mornico D, Hon C-C, Koutero M, Weber C, Coppée J-Y, Clark CG, Dillies M-A, Guillen N (2022) RNA Sequencing Reveals Widespread Transcription of Natural Antisense RNAs in Entamoeba Species. Microorganisms, 10(2):396. doi:10.3390/microorganisms10020396

Moura A, Leclercq A, Vales G, Tessaud-Rita N, Bracq-Dieye H, Thouvenot P, Madec Y, Charlier C, Lecuit M (2023) Phenotypic and genotypic antimicrobial resistance of Listeria monocytogenes: an observational study in France. The Lancet Regional Health Europe, 37:100800. doi:10.1016/j.lanepe.2023.100800

Pottier M, Castagnet S, Gravey F, Leduc G, Sévin C, Petry S, Giard J-C, Le Hello S, Léon A (2023) Antimicrobial Resistance and Genetic Diversity of Pseudomonas aeruginosa Strains Isolated from Equine and Other Veterinary Samples. Pathogens, 12(1):64. doi:10.3390/pathogens12010064

Rouard C, Greig DR, Tauhid T, Dupke S, Njamkepo E, Amato E, van der Putten B, Naseer U, Blaschitz M, Mandilara GD, Cohen SJ, Indra A, Noël H, Sideroglou T, Heger F, van den Beld M, Wester AL, Quilici M-L, Scholz HC, Fröding I, Jenkins C, Weill F-X (2022) Genomic analysis of Vibrio cholerae O1 isolates from cholera cases, Europe, 2022. Euro Surveillance, 29(36):pii=2400069. doi:10.2807/1560-7917.ES.2024.29.36.2400069

Rouard C, Njamkepo E, Quilici M-L, Nguyen S, Knight-Connoni V, Šafránková R, Weill F-X (2024) Vibrio cholerae serogroup O5 was responsible for the outbreak of gastroenteritis in Czechoslovakia in 1965. Microbial Genomics, 10:9. doi:10.1099/mgen.0.001282

Vautrin N, Alexandre K, Pestel-Caron M, Bernard E, Fabre R, Leoz M, Dahyot S, Caron F (2023) Contribution of Antibiotic Susceptibility Testing and CH Typing Compared to Next-Generation Sequencing for the Diagnosis of Recurrent Urinary Tract Infections Due to Genetically Identical Escherichia coli Isolates: a Prospective Cohort Study of Cystitis in Women. Microbiology Spectrum, 11(4):e02785-22. doi:10.1128/spectrum.02785-22

Yassine I​, Hansen EE​, Lefèvre S​, Ruckly C​, Carle I​, Lejay-Collin M​, Fabre L​, Rafei R​, Pardos de la Gandara M, Daboussi F​, Shahin A, Weill F-X (2023) ShigaPass: an in silico tool predicting Shigella serotypes from whole-genome sequencing assemblies. Microbial Genomics, 9:3. doi:10.1099/mgen.0.000961