GPLv3 license JAVA publication

AlienTrimmer

AlienTrimmer is a command line program written in Java to quickly filter out non-confident bases and alien oligo-nucleotide sequences (e.g. adapters, primers, …) from sequencing reads using an accurate alien k-mer matching approach. For more details, see the associated publication (Criscuolo and Brisse 2013).

AlienTrimmer can be used to process short-read data files (paired or not), as well as long-read ones (provided that dedicated parameters are set).

Since AlienTrimmer v2.0, this GitLab repository replaces the previous ftp repository. AlienTrimmer v2.0 (and higher) implements the same alien k-mer search algorithm as the initial release (see S1.2 here), but uses slightly different trimming criteria. Of note, AlienTrimmer v2.0 has simplified options, runs quite faster (especially when compiled using GraalVM), and is able to read/write gzipped FASTQ files.

AlienTrimmer v3.0 implements the low residue content clipping (sometimes called low complexity trimming) via option -c, whereas AlienTrimmer v3.1 introduces option -d to discard reads containing internal alien sequence(s), as well as recommended preset parameters for nanopore data (that can be used via option -N).

Note that alien oligo-nucleotide sequence(s) can be easily inferred without any prior knowledge using the accompanying tool AlienDiscover.

Usage

Run AlienTrimmer without option to read the following documentation:

 AlienTrimmer

 Fast trimming to  filter out non-confident  nucleotides and alien  oligo-nucleotide sequences
 (e.g. adapters, primers, barcodes, indexes, homopolymers, ...) in both 5' and 3' read ends
 gitlab.pasteur.fr/GIPhy/AlienTrimmer                          doi:10.1016/j.ygeno.2013.07.011

 RECOMMENDED PARAMETERS:
   Illumina short reads (default)      -k 10  -m 9   -q 13  -l 50
   Nanopore  long reads (option -N)    -k 9   -m 20  -q 13  -l 500  -d

 USAGE:  AlienTrimmer  [options]

 OPTIONS:
    -i <infile>  [SE] FASTQ-formatted input file; filename should end with .gz when gzipped
    -1 <infile>  [PE] FASTQ-formatted R1 input file; filename should end with .gz when gzipped
    -2 <infile>  [PE] FASTQ-formatted R2 input file; filename should end with .gz when gzipped
    -a <infile>  [SE/PE] input file name containing alien sequence(s);  one line per sequence;
                 lines starting with '>', '%' or '#' are not considered
  --a1 <infile>  [PE] same as -a for only R1 reads
  --a2 <infile>  [PE] same as -a for only R2 reads
    -o <name>    outfile basename: [SE] <name>.fastq[.gz]  or  [PE] <name>.{1,2,S}.fastq[.gz];
                 .gz is added when using option -z
    -k [5-15]    k-mer length k for alien sequence occurence searching (default: 10)
    -m <int>     maximum allowed number of successive non-troublesome bases in trimmed regions
                 (default: k-1)
    -q [0-40]    Phred quality score cutoff to define low-quality bases (default: 13)
    -p [0-100]   maximum allowed percentage of low-quality bases per read (default: 50)
    -c <int>     minimum low residue content region length to clip (default: 0)
    -l <int>     minimum allowed read length (default: 50)
    -d           to discard all reads containing too many internal alien k-mers after trimming
                 (default: not set)
    -N           use preset parameters for nanopore reads: -k 9 -m 20 -q 13 -p 50 -l 500 -d
 --p64           Phred+64 FASTQ input file(s) (default: Phred+33)
    -z           gzipped output files (default: not set)
    -v           verbose mode (default: not set)

 EXAMPLES:
  [SE]  AlienTrimmer -i short-reads.fq       -a aliens.fa  -o trim -p 20 -l 30 -v
  [SE]  AlienTrimmer -i short-reads.fq.gz    -a aliens.txt -o trim -k 11 -q 7 -c 50
  [SE]  AlienTrimmer -i long-reads.fastq     -a aliens.fa  -o trim -N -l 1000
  [PE]  AlienTrimmer -1 r1.fq.gz -2 r2.fq.gz -a aliens.fa  -o trim -m 8 -p 25 -z

Notes

Example

The two files example.fastq.gz and aliens.fa can be found in the directory example/. The gzipped file example.fastq.gz contains six FASTQ-formatted reads, and the FASTA-formatted file aliens.fa contains three oligonucleotide sequences to be trimmed off: an indexed TruSeq adapter and the two homopolymeric segments poly-A and poly-C. Of note, it is highly recommended to use AlienTrimmer with these two last homopolymeric segments as aliens; indeed, library preparation oligonucleotides occurring in sequencing reads are often followed by a stretch of A’s or G’s to be also trimmed off (see e.g. Criscuolo and Brisse 2014).

The following command line can be run to filter out troublesome bases ending up in both 5’ and 3’ ends:

AlienTrimmer -i example.fastq.gz -a alien.fa -o example.trim -v

As the verbose mode is set (option -v), the following information can be read:

AlienTrimmer

FASTQ file:         example.fastq.gz
main options:       -k 10  -m 9  -q 13  -l 50  -p 50
outfile:            example.trim.fastq
no. alien k-mers:   118

@NOHUB:69:GIPHY:2:1102:17743:8484 1:N:0:ATCCTTCC
                                                                                                               ========================================
TTCGTAATTGAGTTCCATCAAGAGCAAACTTATCGAGATCGAGTCAATTATTAACGTGTTCAATCAGTGCTTTTCCTAATTCAGCAGCTTCTGAATCGCCGCTATAGGTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCCTT
             *                              *                                            *                                                   *  *      
                                                                                                               <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1103:25789:13463 1:N:0:ATCTTTCC
                                                                   ===================================== ============================       ===========
TTGCACATCAATGTAGTCAAACTCGCAAATGGAAAGAATTAGAAAAAGATTTCTTTAAAAAATTATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCTTTCCATCTCGTATGCCGTCTTCTGCTTGAAAATGTGGGGGGGGGGG
  *        *    *             *      *           * *          ** ** *   *              *        *       *   * * *         ** * *        *  *  ** *  ***
                                                              <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1104:16893:2253 1:N:0:ATCCTTCC
                                   ===========                                                                                                         
ATTTGCGCTCAAAAAAAGACAACAAAGATAATTGATTTTTTTTTTTAAACATCAGAAGAAAACTTCTCCACACAACGAACAAACATTTCTACACCCATAGACAAAACAGTTTCATCAAAATCAAAACGAGGATGATGATGAGGATAAGCTA
 * * **  *** ****                                 *                 *                                   *                 ** *** * *** *  ***  *    *  
>>>>>>>>>>>>>>>>>                                                                                                         <<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1105:17562:2597 1:N:0:ATCCTTCC   [DISCARDED]
              ====================     =======================                                                 ========================================
TTGTTGAAACATCACCCCCCCCCCCCCCCCCCCCACCCACCCCCCCCCCCCCCCCCCCCCCCACCAATTAAAAAAAAACCAAAAATAGGAAACATAAAAATAATATATAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
             *                 *   ***********                ********** ** * ****** **** *** ** ** ****** **** * *   *   *  *             * *    *   *
             <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

infile:             6 reads   906 bases
outfile:            5 reads   580 bases

Total running time: 0min 00sec

Each trimmed read is displayed with highlighted troublesome bases, i.e. alien k-mers (=) and non-confident bases (*). Trimmed regions are indicated with > (5’) and < (3’).

The two first reads end up with the specified TruSeq adapter, and both reads are trimmed off accordingly.

The third read contains many non-confident bases in both 5’ and 3’ ends, which are trimmed off accordingly.

The fourth read seems artefactual, as it is made up by homopolymeric regions and non-confident bases. After trimming, only 13 bases remain and are therefore discarded, i.e. minimum length = 50 bases (option -l).

The previous command line can be completed with option -c to also filter out low residue content regions:

AlienTrimmer -i example.fastq.gz -a alien.fa -o example.trim -c 50 -v

leading to the following output:

AlienTrimmer

FASTQ file:         example.fastq.gz
main options:       -k 10  -m 9  -q 13  -l 50  -p 50  -c 50
outfile:            example.trim.fastq
no. alien k-mers:   118

@NOHUB:69:GIPHY:2:1102:17743:8484 1:N:0:ATCCTTCC
                                                                                                               ========================================
TTCGTAATTGAGTTCCATCAAGAGCAAACTTATCGAGATCGAGTCAATTATTAACGTGTTCAATCAGTGCTTTTCCTAATTCAGCAGCTTCTGAATCGCCGCTATAGGTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCCTT
             *                              *                                            *                                                   *  *      
                                                                                                               <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1103:25789:13463 1:N:0:ATCTTTCC
                                                                   ===================================== ============================       ===========
TTGCACATCAATGTAGTCAAACTCGCAAATGGAAAGAATTAGAAAAAGATTTCTTTAAAAAATTATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCTTTCCATCTCGTATGCCGTCTTCTGCTTGAAAATGTGGGGGGGGGGG
  *        *    *             *      *           * *          ** ** *   *              *        *       *   * * *         ** * *        *  *  ** *  ***
                                                              <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1104:16893:2253 1:N:0:ATCCTTCC
                                   ===========                                                                                                         
ATTTGCGCTCAAAAAAAGACAACAAAGATAATTGATTTTTTTTTTTAAACATCAGAAGAAAACTTCTCCACACAACGAACAAACATTTCTACACCCATAGACAAAACAGTTTCATCAAAATCAAAACGAGGATGATGATGAGGATAAGCTA
 * * **  *** ****                                 *                 *                                   *                 ** *** * *** *  ***  *    *  
>>>>>>>>>>>>>>>>>                                                                                                         <<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1105:17562:2597 1:N:0:ATCCTTCC   [DISCARDED]
              ====================     =======================                                                 ========================================
TTGTTGAAACATCACCCCCCCCCCCCCCCCCCCCACCCACCCCCCCCCCCCCCCCCCCCCCCACCAATTAAAAAAAAACCAAAAATAGGAAACATAAAAATAATATATAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
      *************************************************************************************************************************************************
      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@NOHUB:69:GIPHY:2:1107:13475:6174 1:N:0:ATCCTTCC   [DISCARDED]
                                                                                                                                                       
CGCTGTGCGCCAACCTGGCCGAAGTGCGCGCCATGCGGCGGGGCGGGCGCGCCGCGCCGCGCGCGCGCGCGCGCGCGCGCGCGCGCCGCGCGCGCCGCGCGCGCGCGCCGCGCCGCGCGCCGCCGCCGCGCCGCGCACACACACACACACA
                ***************************************************************************************************************************************
                <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

infile:             6 reads   906 bases
outfile:            4 reads   429 bases

Total running time: 0min 00sec

Interrestingly, using option -c enables to detect and discard another read, which is likely artefactual as it mainly contains many repeated heteropolymers.

References

Criscuolo A, Brisse S (2013) ALIENTRIMMER: A tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics, 102(5-6):500-506. doi:10.1016/j.ygeno.2013.07.011

Criscuolo A, Brisse S (2014) AlienTrimmer removes adapter oligonucleotides with high sensitivity in short-insert paired-end reads. Commentary on Turner (2014) Assessment of insert sizes and adapter content in FASTQ data from NexteraXT libraries. Frontiers in Genetics, 5:130. doi:10.3389/fgene.2014.00130

Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17):i884-i890. doi:10.1093/bioinformatics/bty560