Reaper is a program for demultiplexing, trimming and filtering short read sequencing data. It is intended to be suitable for dealing with a wide range of sequencing protocols, and will be updated to support new protocols where the current set of features does not suffice. It has the following features.
Choose the appropriate geometry for your data: one of no-bc (no barcode), 5p-bc (5' barcode) or 3p-bc (3' barcode). This will be the argument to the -geom option. The input file, normally (but not necessarily) a FASTQ file, is specified by giving its name as argument to the -i option.
If your files are barcoded, find out what the barcodes are.
If possible, try to find out the 3' adapter sequence and the 5' adapter sequence that were used. The former will be stripped from the read if present. The latter, if specified, is treated as a tabu sequence. A read will be discarded if it is found to contain any tabu sequence.
If you do not know the 3' adapter it is possible to search for it in the input files with the program minion, that is shipped with reaper. For a FASTQ file called input.fq.gz usage is as follows:
By default minion will take the first 2 million reads. Use the -do option to change this.
Create a metadata file describing barcodes and adapter sequences. The required columns in this file depend on the geometry, as described in Read geometry. A line must be present for each barcode. The 3' adapter must be specified in the 3p-ad column. If you do not know the adapter, supply either an empty field or a single hyphen. The tabu column is similarly required. Use either the 5' adapter if known, or supply an empty field or a single hyphen.
The metadata file must be tab-separated with all fields present as described in Read geometry, and each row must have the same number of fields. Neither missing fields nor spurious fields are allowed. If a field is allowed to be empty it can be achieved either by leaving the field empty, or by using a single hyphen. Below is a list of example metadata files, one for each geometry. Note that copying and pasting such an example will yield a file with spaces rather than tab characters, and this would need fixing.
In this example no sequence inserts were used, but the 5p-si (5' sequence insert) column is required to be present. All its fields are empty, as indicated by the use of a single hyphen. It can be further be noted that the only variable column is that containing the barcodes, as is to be expected.
The column headers (and hence associated columns) do not need to be in a particular order.
If the reads are barcoded, you will have created a metadata file. Suppose the geometry is 5p-bc, the FASTQ input is in a file called sample.fastq and the metadata file is called sample.txt. The basic reaper command line invocation then is:
or
Note that reaper places no requirements on file name suffixes, so both data file and metadata file may have arbitrary names. Additionally, the file containing the reads is allowed to be gzipped.
If the reads are not barcoded, it is possible to run reaper with or without a metadata file. The metadata file requires just the columns 3p-ad and tabu. Again, unknown fields can be specified by leaving the field empty or specifying a single hyphen. The invocation is similar as before, with just the -geom option changed:
In the absence of barcodes, it is possible to run reaper without a metadata file. In this case the options -3pa and -tabu are required. If the corresponding sequences are not known, specify an empty string again either as "" or as a single hyphen.
By default output files are created that share the prefix out. Use the -basename option to change this prefix. The supplied prefix may contain a path (i.e. directory components), but the path must already exist as reaper will not create it.
Three geometries are supported. A geometry is a description of what a read looks like, i.e. the read design. The supported geometries are described in greater detail below.
In the read representation below, a part enclosed in () indicates that the part must be present in the read, a part enclosed in [] is optional. A part enclosed in <> should not be present. This is unique to the 5' adapter part of a read, as the sequencing primer anneals to the 5' adapter, and hence the latter should not be seen in the raw read. Currently we have no means to strip a read of a such a match and proceed. The occurrence of this phenomenon and the quality of the remaining read should be quantified and any prospected gain evaluated before such a scheme is implemented.
The 5' sequence insert is special. If such a sequence insert is present in the read design, (a.k.a. geometry), it should be present in all reads. For simplicity, the corresponding 5p-si column is always required; if no sequence insert was used the fields in this column can be left empty. In the meta file, the following columns are required by each of the geometries, although some of them may be empty.
This data is used as follows, for each read.
If present, the meta file will normally have more than one data row, each row representing a different barcode, and the barcode column will normally be the only column that assumes different values.
If absent, the meta file should have only a single data row. For the 3p-bc geometry, reaper constructs the concatenation of the sequences in columns 3p-si, barcode, and 3p-ad (note: 3p-si may be empty). It does this for all different barcodes, and each such concatenation is aligned to the read. The best match, if it passes the alignment requirements, determines the barcode bucket to which the read is allocated.
For the 5p-bc geometry, a matching barcode at the moment is required to be present at the beginning of the read. Mismatches may occur and the extent and types of allowable mismatches are under control of the user. Conceivably people might start to add random sequence inserts before their barcodes; this is not yet supported.
For all geometries the tabu column is treated the same. It may be empty, or it may contain one stretch of sequence, or a list of comma-separated sequences. If a sequence is specified, it will be aligned to the read. If a match is found that exceeds the requirements as specified by the program option -mr-tabu, the read is discarded. There is currently no provision to simply remove the matching part. We currently suggest to supply the 5' adapter sequence, if known, as a tabu sequence.
The 3p-si column is only relevant for the 3p-bc geometry. It is part of the concatenated sequence constructed and aligned as described above.
For the 5p-si column, if present, an alignment is attempted against the leading part of the read. This alignment should pass the -5p-sinsert alignment criteria in order for the read to pass.
When aligning reads with bits of sequence (adapters, barcodes and inserts), reaper computes a full local alignment matrix for the two sequences, sometimes on a restricted part of one of the two sequences to be aligned. The alignment matrix is used in various ways.
For checking matches with the 3' adapter and tabu contamination, the highest scoring local alignment is considered. Neither read nor sequence to be matched is restricted.
Additionally, for the 3' adapter matches at the end of the read and the start of the adapter are considered.
For a 5' barcode and for a 5' sequence insert the aligment matrix is restricted to an initial part of the read. The length of this part is equal to the length of the sequence to match plus the maximal allowed edit distance.
A fundamental test used by Reaper is whether an alignment between two sequences is considered to be a true match. These criteria are to a very large extent under control of the user. The test is generically described as follows. Two sequences match if the highest scoring alignment satisfies:
The alignment stretches over at least L aligned bases.
Along this stretch at most E edits (substitutions or gaps) are found.
Along this stretch at most G gaps are found.
This is described in greater detail in the sections below. Importantly, if no gaps are allowed, the alignment algorithm will not generate alignments with gaps.
Three prime adapter contanimation almost always involves the leading part of the adapter, and using the full adapter sequence increases the chance of spurious matches. To this end, it is not advisable to use the full-length adapter. Reaper will by default only consider the first eighteen bases in the adapter. This number can be changed using the -restrict <int> option.
It is usually unnecessary to trim three prime adapters from long reads, as more often than not adapter contanimation will occur rarely in the raw read files. If adapter trimming is still desirable, there is a concern over false matches. With matching parameters of e.g. 14/2/0 (alignment over 14 bases with two mismatches and no indels allowed) each 14-mer in the adapter can be matched by 91 different 14-mers, and for an 18-base adapter this leads to 455 matching 14-mers. Whilst this is a concern for both short reads and long reads, in long reads the scope for false positives is further amplified. In transcriptomics data this could lead to gaps in transcript assemblies. As a very simple unsophisticated measure reaper now offers the -guard <int> option. This will hide the first <int> read bases from the matching process. This option is off by default.
An important part of reaper is the way in which a read/3p-adapter alignment is judged to resemble a true match. The complicating factor is that only a match to the initial part of the adapter may be present, possibly followed by poly-As or low complexity sequence where the sequencer ran out of sequence. The following reaper options control the matching behaviour:
This restricts the matching algorithm to only consider the first part of the adapter - the argument to this option is the number of bases considered.
This option specifies stringency criteria for the best alignment found between the 3' adapter and the read, anywhere. If the alignment passes the criteria (see below), the read is considered to have adapter sequence at that position.
This option specifies stringency criteria for an alignment that matches the start of the 3' adapter with the end of the read. Additionally, the alignment may occur elsewhere in the read if followed by low complexity sequence. If the alignment passes the criteria (see below), the read is considered to have adapter sequence at that position.
This option specifies the minimum length at which a perfect match at the end of the read to the beginning of the 3' adapter should be stripped. Examples:
Such a match is removed if it is present and has length at least k (the argument to this option). If k is zero no such match is attempted and this is the default setting. For data heavily contaminated with 3' adapter sequence it can be worthwhile to always strip a perfect match (for example using -3p-head-to-tail 1), or otherwise the last bases of cleaned reads will be severely biased towards the bases at the start of the adapter sequence. For such data this effect is worse than the effect of false positive matching, that is, those instances where a head-to-tail match exists and is stripped, and where the matching part is in fact properly part of the biological sequence fragment.
This specifies stringency criteria (see below) for the barcode section of the best alignment between the read and the concatenation 3p-si+barcode+3p-ad in the geometry 3p-bc. All barcodes are tried and the best scoring barcode is accepted. This is different from the 5' barcode case described below. The barcode is currently hardcoded to start either at the first or second base of the adapter.
This specifies stringency criteria (see below) for the alignment between a 5' barcode and the read. All barcodes are tried and the read is accepted only if a single barcode passes these criteria.
This specifies stringency criteria for a match between the read and a sequence insert that should be present at the beginning of the read (after stripping of the 5' barcode if present).
The following options all take a similar form of argument. Where an X is written below, reaper ignores the corresponding parameter due to the availability of a canonical value (e.g. length of barcode). It should ordinarily not be necessary to specify the offset parameter (below given as o).
The default value for o is 2.
The first (l) position is ignored, and the fourth (b) position is interpreted as a bit field (see below).
The default value for o is 10.
The argument to each is of the form l/e/g/o (example: 14/2/0/0), as explained below. Generally speaking, the first, second, third and fourth positions respectively denote parameters related to length, edit distance, gap size, and offset. Exceptions exist, such as the fourth position in the -5p-barcode argument.
It is possible to leave out any trailing part, so l/e or 12/2 is a valid form as well. The meaning of the separate parts are as follows. The context is that of the best local alignment found for a read and an adapter sequence, to decide whether that alignment represents a true match. It is judged to represent a true match if the alignment has a subpart satisfying the following criteria.
The fourth b field in the -5p-barcode option is a bit field. Bit 1 implies that a zip alignment is attempted first between barcode and read. A zip alignment is one where only mismatches are allowed and the beginning of the barcode is aligned precisely to the beginning of the read. This can be useful for short barcodes, where matching barcodes at an offset in the read may too easily lead to false matches. Bit 2 implies that an alignment is attempted where the latter is allowed. In this case the number of offset positions are counted as edits and contribute towards the total number of edits (that will be compared with the e part). It is possible to combine the two approaches. By default the fourth field is set to 2. For short barcodes it can be advisable to set it either to 1 or 3.
For adapter matching and tabu matching the offset position, if utilised, indicates how far from the start of the adapter the match is allowed to occur.
For -5p-sinsert the offset position indicates how far into the read the sequence insert is allowed to extend beyond the length of the sequence insert itself. For very short inserts it is useful to set this to zero or a very small value.
The way reads are processed is dependent on the presence or absence of barcodes and the geometry (i.e. lay-out) of the read.
For geometry no-bc, matching is attempted with up to three different types of oligo fragments, all of which can be optional.
If specified in the 5p-sinsert column or with the -5psi option, a match is attempted between this sequence insert and the start of the read. If no such match is found, the read is by default discarded. If the keep-all option is specified however such reads are truncated to be empty and kept as clean reads.
Subsequently, if specified in the 3p-ad column or with the -3pa option, a match is attempted with that adapter and the read. Matching is attempted in a cascading fashion, as described above under
If a match is present, it is removed. Naturally, a match is not required. Further trimming is attempted as described in section Further read processing. After this, if the read passes the length cutoff (by default no such cutoff is enforced), then the read is checked for contamination by tabu sequences, if any.
With the no-bc geometry reaper can be used either with or without a metadata file. If no metadata file is used the option -3pa is required. This option specifies the 3' adapter. It is still possible to turn off 3' adapter matching by passing either the empty string or a single hyphen as argument to -3pa.
Reaper processing in the no-bc geometry can be fully controlled without a metadata file by using the options -3pa, -tabu and -5psi.
For geometry 3p-bc, a match is attempted with the concatenation 3p-si+barcode+3p-ad, as described above under
A match is required for the read to pass, so that it can be allocated to a barcode bucket. For all barcodes, the full concatenation is aligned and checked against the criteria specified by -3p-global and -3p-prefix, in that order. One of these two checks should succeed for a concatenation to be considered further. Additionally, the barcode section of the concatenated oligonucleotide is separately required to pass the criteria specified by -3p-barcode.
It should be noted that it is possible for multiple barcodes to pass the procedure just described, especially if barcodes are short. The reason is that read errors are increasingly likely towards the end of a read and may occur in the barcode section. The match criteria thus have to allow for a positive edit distance. Hence, alignment scores are tracked and the concatenation (and associated barcode) with the best score is chosen. The case where two or more barcodes are tied is considered a conflict, and none is chosen.
Barcode resolution is followed by further attempted trimming as described in Further read processing. After this, if the read passes the length cutoff (by default no such cutoff is enforced), then tabu sequences are checked, if any.
For geometry 5p-bc a barcode must be found at the beginning of the read. All barcodes are aligned and checked using the criteria specified by the -5p-barcode option. Subsequently, a match with the 5' sequence insert is required if such an insert is specified in the 5p-si column of the metadata file. This match can be controlled using the option 5p-sinsert. Subsequently, matching of the 3' adapter is attempted as described above under
and this is combined with further read trimming as described in Further read processing.
Before and after the check for adapter presence, a number of other criteria are tested, refered to as trimming tests further below. These test may establish that trimming is necessary independent of whether an adapter match is present or not. They are described below, and include checks for low base call quality, occurrence of (too many) N-masked bases, and low-complexity sequence. Each test that is employed can potentially necessitate the removal of a suffix of certain length. If at least one of these tests succeeds, the longest such suffix will be removed.
Low quality sequence can be detected using the median quality value in a sliding window. The quality-based trimming test is defined by the first base where this median value drops below a specified threshold. This test is applied to the full read, bar an initial (adjustable) offset that is skipped. The cutoff value and window length are specified using -qqq-check <cut-off>/<length>. By default the trim cutoff is at the start of the window. Optionally an offset into the window can be specified using -qqq-check <cut-off>/<length>/<offset>. It is possible (and the default approach) to omit an initial part of the read from inspection by using -qqq-check <cut>/<length>/<offset>/<prefix-size>. This is useful for sequencers that require calibration during the first few sequencing cycles. The default value for <prefix-size> is 10.
For -qqq-check the cutoff relates to the raw ascii values found in the file and expressly not to the transformed P-values they represent.
B is a special Illumina quality score indicating the base at that position should not be trusted. Runs of trailing Bs can be detected by specifying either --bcq-early or --bcq-late. With the former option the trimming is applied to the full read (before the adapter is matched), and consequently can lead to stripping of bases that could aid in adapter recognition. With the latter option the trimming test is applied after adapter matching and handled simultaneously with the other trimming tests.
Trailing As can be detected using -polya <count>. This test only succeeds if the poly-A sequence has length at least <count>. It is applied to the full read, before the adapter is matched. It is worthwhile to be aware of the more powerful -dust-suffix and -dust-suffix-late options. These will also remove poly-A stretches interspersed with other bases, and it is possible to make these options exclusively sensitive to A-rich sequences by using for example -dust-suffix 20/A and -dust-suffix-late 20/A. At the recommended default DUST cutoff 20 the smallest poly-A stretch recognised by these options is of length 6. Hence, depending on the data, the -polya option may still be of use.
Bases may not be called and show up as N in the read. This trimming test uses a sliding window to identify whether a read suffix should be removed. Use -nnn-check <count>/<length> to test for the presence of a window of size <length> in which at least <count> Ns occur. The suffix starting at the first such window is the result of this test.
Reaper has two mechanisms to deal with low complexity sequence. The two use very similar principles, as explained below.
The first approach that can be taken is to test for low complexity sequence at the end of a read, before adapter sequence is removed. This is done by computing the same criterion as used by DUST. The read suffix with highest DUST score is considered. The trimming test succeeds and this suffix is used if the score exceeds or is equal to the threshold specified as -dust-suffix <threshold>.
A variant of this, still part of the first mechanism, is to apply this test after initial trimming by other means, such as trimming by adapter, quality, or occurrence of N-masked bases. Sequencing protocols now exist that require this. In some protocols it is possible that a 3' adapter is still preceded by low complexity sequence, and in that case the detection test that starts at the end of the read can not extend beyond the adapter. An adapter-aware test can be specified independently with -dust-suffix-late <threshold> and works identically to -dust-suffix. It is possible and can be useful to use both options simultaneously, and a good value to use is the default DUST threshold 20.
For both these options, -dust-suffix and -dust-suffix-late, it is possible to restrict them to specific types of low-complexity sequence, namely those dominated by a single base. Examples are poly-A stretches interspersed with other bases or poly-T stretches interspersed with other bases. To exclude only such stretches use -dust-suffix 20/A to exlude poly-A dominated sequence or -dust-suffix 20/AT to exclude sequence that is either poly-A dominated or poly-T dominated. Other combinations of bases are allowed. A sequence is recognised as a poly-base low complexity sequence if it satisfies the DUST criterion and at least half of the bases are the same nucleotide and that nucleotide has been specified. If no bases are specified, no such demand is made.
The second mechanism removes the final cleaned read if it does not satisify the threshold specified by -tri <threshold>. It is an accident of history that this criterion is a scaled version of the DUST criterion, applied to the full length of the clean read. It was proposed by Hervé Pagès, and a given DUST threshold (used by -dust-suffix and -dust-suffix-late) must be multiplied by two to obtain the corresponding -tri threshold. A reasonable threshold to choose for this option, if used, is thus -tri 40, corresponding with the default DUST threshold of 20.
It can be useful to refrain from tri-nucleotide filtering so that quality control plots give a comprehensive view. It is possible to output the trinucleotide score alongside the read with reaper (using the formatting directive %T) and filter at a later stage.
After adapter matching and trimming the length of a read may be short. With -clean-length <length> any read shorter than <length> will be discarded.
Again, it can be useful to refrain from length-filtering so that quality control plots give a comprehensive view. Empty cleaned reads generally correspond to adapter polymers, and a peak at this length could correlate with size selection or ligation issues.
The examples below assume a directory called out exists.
Below is the reaper version for this document and the corresponding help output, resulting from reaper --version and reaper -h respectively.
By default reaper expects FASTQ input and writes FASTQ output. A few options exist providing other formats. Additionally it is possible to specify custom input and output format using two simple format specification syntaxes.
FASTA input and output are specified using --fasta-in and --fasta-out. A slightly extended FASTQ format is specified with --fastqx-out. In this format the identifier line is extended with a recno=<NUM> field, where <num> indicates the record offset in the input file. Such a field is necessary for example to re-pair paired-end data after processing with reaper; tally reads this format when equipped with the --fastqx-in option.
The output format specification syntax is specified in the previous section, as part of the reaper help output under -format-clean/lint specification syntax. The input format specification language is given below. It can be accessed on the command line by issuing reaper --record-format.
Do not use -basename foo/bar/ -- then files such as foo/bar/.clean.gz will be created (and not be visible with default ls).
Reaper was written by Stijn van Dongen and benefited greatly from suggestions by Cei Abreu-Goodger, Anton Enright, Mat Davis, Sergei Manakhov, Harpreet Saini, Nenad Bartonicek and Leonor Quintais. For questions and feedback send e-mail to kraken @ ebi . ac . uk.
This page was generated from Aephea macros, http://micans.org/aephea.