April 17, 2015
reaper-15-065
Reaper
Demultiplexing, trimming and filtering sequencing data

Reaper is a program for demultiplexing, trimming and filtering short read sequencing data. It is intended to be suitable for dealing with a wide range of sequencing protocols, and will be updated to support new protocols where the current set of features does not suffice. It has the following features.

One read-at-time, written in C, file compression
Small memory footprint (order of megabytes).
1-4 million reads per minute (depending on read length, contamination check).
Twinned with tally for fast uniquifying of reads.
Can read and write gzip-compressed data.
Read trimming
3' adapter removal.
Sliding window median quality based trimming (-qqq-check).
Low complexity tail removal with DUST-type criterion (-dust-suffix and -dust-suffix-late).
Killer-B scrubbing (--bcq-early and --bcq-late).
Cluster-N cleansing (-nnn-check).
Each read is matched with the three prime adapter
Smith-Waterman local alignment over full read and adapter.
Read/adapter match is user controlled (-3p-global).
Handling of read-suffix adapter-prefix matching followed by read low complexity (-3p-prefix and -3p-head-to-tail).
Flexible input and output
By default the input is assumed to be in FASTQ format.
A simple built-in parser allows a variety of input file formats (-record-format and -record-format).
Reaper transparently reads input that was compressed by gzip.
Reaper writes gzip output by default and this can be overridden (--nozip).
A simple specification language allows flexible output file formats (-format-clean and -format-lint).
Rich selection of output field for comprehensive inspection.
Barcode filtering
Output split into per-barcode clean files and annotated discard files.
Recognise different layouts of barcodes and sequence inserts (-geom).
Mismatches can be controlled (-3p-barcode, -5p-barcode, -5p-sinsert).
Quality checks
Outputs per-barcode summary statistics for comprehensive QC plots.
Per-read characteristics available (e.g. alignment, trinucleotide score).
Reads can be checked for ligation contamination
Smith-Waterman local alignment.
Multiple checks for different oligos are possible.
Generically supported as match requirements for tabu sequences (-mr-tabu).
Paired end data
It can work in conjuction with tally to process paired end files.
This requires encoding the record offset in the output using the %J output syntax.

Choose the appropriate geometry for your data: one of no-bc (no barcode), 5p-bc (5' barcode) or 3p-bc (3' barcode). This will be the argument to the -geom option. The input file, normally (but not necessarily) a FASTQ file, is specified by giving its name as argument to the -i option.

If your files are barcoded, find out what the barcodes are.

If possible, try to find out the 3' adapter sequence and the 5' adapter sequence that were used. The former will be stripped from the read if present. The latter, if specified, is treated as a tabu sequence. A read will be discarded if it is found to contain any tabu sequence.

If you do not know the 3' adapter it is possible to search for it in the input files with the program minion, that is shipped with reaper. For a FASTQ file called input.fq.gz usage is as follows:

minion search-adapter -i input.fq.gz

By default minion will take the first 2 million reads. Use the -do option to change this.

Create a metadata file describing barcodes and adapter sequences. The required columns in this file depend on the geometry, as described in Read geometry. A line must be present for each barcode. The 3' adapter must be specified in the 3p-ad column. If you do not know the adapter, supply either an empty field or a single hyphen. The tabu column is similarly required. Use either the 5' adapter if known, or supply an empty field or a single hyphen.

The metadata file must be tab-separated with all fields present as described in Read geometry, and each row must have the same number of fields. Neither missing fields nor spurious fields are allowed. If a field is allowed to be empty it can be achieved either by leaving the field empty, or by using a single hyphen. Below is a list of example metadata files, one for each geometry. Note that copying and pasting such an example will yield a file with spaces rather than tab characters, and this would need fixing.

5' barcode example metadata file (geometry 5p-bc)

barcode 3p-ad tabu 5p-si ACTA ATCTCGTATGCCGTCTTCTGCTTG GTTCAGAGTTCTACAGTCCGACGATC - ACAA ATCTCGTATGCCGTCTTCTGCTTG GTTCAGAGTTCTACAGTCCGACGATC - CTAA ATCTCGTATGCCGTCTTCTGCTTG GTTCAGAGTTCTACAGTCCGACGATC -

In this example no sequence inserts were used, but the 5p-si (5' sequence insert) column is required to be present. All its fields are empty, as indicated by the use of a single hyphen. It can be further be noted that the only variable column is that containing the barcodes, as is to be expected.

3' barcode example metadata file (geometry 3p-bc)

3p-si 3p-ad barcode tabu T TCGTATGCCGTCTTCTGCTTGT ACCT GTTCAGAGTTCTACAGTCCGACGATC T TCGTATGCCGTCTTCTGCTTGT AGCA GTTCAGAGTTCTACAGTCCGACGATC T TCGTATGCCGTCTTCTGCTTGT AGGT GTTCAGAGTTCTACAGTCCGACGATC T TCGTATGCCGTCTTCTGCTTGT CAAG GTTCAGAGTTCTACAGTCCGACGATC

The column headers (and hence associated columns) do not need to be in a particular order.

no barcode example metadata file (geometry no-bc)

3p-ad tabu TCGTATGCCGTCTTCTGCTTGT GTTCAGAGTTCTACAGTCCGACGATC

If the reads are barcoded, you will have created a metadata file. Suppose the geometry is 5p-bc, the FASTQ input is in a file called sample.fastq and the metadata file is called sample.txt. The basic reaper command line invocation then is:

reaper -i sample.fastq -meta sample.txt -geom 5p-bc

or

reaper -i sample.fastq -meta sample.txt -geom 5p-bc

Note that reaper places no requirements on file name suffixes, so both data file and metadata file may have arbitrary names. Additionally, the file containing the reads is allowed to be gzipped.

If the reads are not barcoded, it is possible to run reaper with or without a metadata file. The metadata file requires just the columns 3p-ad and tabu. Again, unknown fields can be specified by leaving the field empty or specifying a single hyphen. The invocation is similar as before, with just the -geom option changed:

reaper -i sample.fastq -meta sample.txt -geom no-bc

In the absence of barcodes, it is possible to run reaper without a metadata file. In this case the options -3pa and -tabu are required. If the corresponding sequences are not known, specify an empty string again either as "" or as a single hyphen.

reaper -i sample.fastq -geom no-bc -3pa TCGTATGCCGTCTTCTGCTTGT -tabu GTTCAGAGTTCTACAGTCCGACGATC

By default output files are created that share the prefix out. Use the -basename option to change this prefix. The supplied prefix may contain a path (i.e. directory components), but the path must already exist as reaper will not create it.

Three geometries are supported. A geometry is a description of what a read looks like, i.e. the read design. The supported geometries are described in greater detail below.

In the read representation below, a part enclosed in () indicates that the part must be present in the read, a part enclosed in [] is optional. A part enclosed in <> should not be present. This is unique to the 5' adapter part of a read, as the sequencing primer anneals to the 5' adapter, and hence the latter should not be seen in the raw read. Currently we have no means to strip a read of a such a match and proceed. The occurrence of this phenomenon and the quality of the remaining read should be quantified and any prospected gain evaluated before such a scheme is implemented.

| < - - - - - - - - - - - - Read geometry - - - - - - - - - - - - - -> | Geometry | <---------- 5p region ---------> Message <---- 3p region ----> ---------+---------------------------------------------------------------------- no-bc | < 5p-ad > M [ 3p-ad ] 3p-bc | < 5p-ad > M ( 3p-si barcode 3p-ad ) 5p-bc | < 5p-ad > ( barcode ) [( 5p-si )] M [ 3p-ad ]

The 5' sequence insert is special. If such a sequence insert is present in the read design, (a.k.a. geometry), it should be present in all reads. For simplicity, the corresponding 5p-si column is always required; if no sequence insert was used the fields in this column can be left empty. In the meta file, the following columns are required by each of the geometries, although some of them may be empty.

Geometry tabu barcode 5p-si 3p-si 3p-ad ---------| ----+-------+-----+-----+-----+ no-bc | * | | | | * | 3p-bc | * | * | | * | * | 5p-bc | * | * | * | | * |

This data is used as follows, for each read.

The barcode column

If present, the meta file will normally have more than one data row, each row representing a different barcode, and the barcode column will normally be the only column that assumes different values.

If absent, the meta file should have only a single data row. For the 3p-bc geometry, reaper constructs the concatenation of the sequences in columns 3p-si, barcode, and 3p-ad (note: 3p-si may be empty). It does this for all different barcodes, and each such concatenation is aligned to the read. The best match, if it passes the alignment requirements, determines the barcode bucket to which the read is allocated.

For the 5p-bc geometry, a matching barcode at the moment is required to be present at the beginning of the read. Mismatches may occur and the extent and types of allowable mismatches are under control of the user. Conceivably people might start to add random sequence inserts before their barcodes; this is not yet supported.

The tabu column

For all geometries the tabu column is treated the same. It may be empty, or it may contain one stretch of sequence, or a list of comma-separated sequences. If a sequence is specified, it will be aligned to the read. If a match is found that exceeds the requirements as specified by the program option -mr-tabu, the read is discarded. There is currently no provision to simply remove the matching part. We currently suggest to supply the 5' adapter sequence, if known, as a tabu sequence.

The 5p-si and 3p-si columns

The 3p-si column is only relevant for the 3p-bc geometry. It is part of the concatenated sequence constructed and aligned as described above.

For the 5p-si column, if present, an alignment is attempted against the leading part of the read. This alignment should pass the -5p-sinsert alignment criteria in order for the read to pass.

When aligning reads with bits of sequence (adapters, barcodes and inserts), reaper computes a full local alignment matrix for the two sequences, sometimes on a restricted part of one of the two sequences to be aligned. The alignment matrix is used in various ways.

For checking matches with the 3' adapter and tabu contamination, the highest scoring local alignment is considered. Neither read nor sequence to be matched is restricted.

Additionally, for the 3' adapter matches at the end of the read and the start of the adapter are considered.

For a 5' barcode and for a 5' sequence insert the aligment matrix is restricted to an initial part of the read. The length of this part is equal to the length of the sequence to match plus the maximal allowed edit distance.

Alignment acceptance tests A fundamental test used by Reaper is whether an alignment between two sequences is considered to be a true match. These criteria are to a very large extent under control of the user. The test is generically described as follows. Two sequences match if the highest scoring alignment satisfies:

The alignment stretches over at least L aligned bases.

Along this stretch at most E edits (substitutions or gaps) are found.

Along this stretch at most G gaps are found.

This is described in greater detail in the sections below. Importantly, if no gaps are allowed, the alignment algorithm will not generate alignments with gaps.

Use only the start of the adapter Three prime adapter contanimation almost always involves the leading part of the adapter, and using the full adapter sequence increases the chance of spurious matches. To this end, it is not advisable to use the full-length adapter. Reaper will by default only consider the first eighteen bases in the adapter. This number can be changed using the -restrict <int> option.

Long reads It is usually unnecessary to trim three prime adapters from long reads, as more often than not adapter contanimation will occur rarely in the raw read files. If adapter trimming is still desirable, there is a concern over false matches. With matching parameters of e.g. 14/2/0 (alignment over 14 bases with two mismatches and no indels allowed) each 14-mer in the adapter can be matched by 91 different 14-mers, and for an 18-base adapter this leads to 455 matching 14-mers. Whilst this is a concern for both short reads and long reads, in long reads the scope for false positives is further amplified. In transcriptomics data this could lead to gaps in transcript assemblies. As a very simple unsophisticated measure reaper now offers the -guard <int> option. This will hide the first <int> read bases from the matching process. This option is off by default.

An important part of reaper is the way in which a read/3p-adapter alignment is judged to resemble a true match. The complicating factor is that only a match to the initial part of the adapter may be present, possibly followed by poly-As or low complexity sequence where the sequencer ran out of sequence. The following reaper options control the matching behaviour:

-restrict

This restricts the matching algorithm to only consider the first part of the adapter - the argument to this option is the number of bases considered.

-3p-global

This option specifies stringency criteria for the best alignment found between the 3' adapter and the read, anywhere. If the alignment passes the criteria (see below), the read is considered to have adapter sequence at that position.

-3p-prefix

This option specifies stringency criteria for an alignment that matches the start of the 3' adapter with the end of the read. Additionally, the alignment may occur elsewhere in the read if followed by low complexity sequence. If the alignment passes the criteria (see below), the read is considered to have adapter sequence at that position.

-3p-head-to-tail

This option specifies the minimum length at which a perfect match at the end of the read to the beginning of the 3' adapter should be stripped. Examples:

ACGTACGATGCTAGCTAGCTAGCTCGCGCTATCTTCTA ACGTACGATGCTAGCTAGCTAGCTCGCGCTATCTTCT| ACGTACGATGCTAGCTAGCTAGCTCGCGCTATCTTC|| ||| CTAGTCGTAC

Such a match is removed if it is present and has length at least k (the argument to this option). If k is zero no such match is attempted and this is the default setting. For data heavily contaminated with 3' adapter sequence it can be worthwhile to always strip a perfect match (for example using -3p-head-to-tail 1), or otherwise the last bases of cleaned reads will be severely biased towards the bases at the start of the adapter sequence. For such data this effect is worse than the effect of false positive matching, that is, those instances where a head-to-tail match exists and is stripped, and where the matching part is in fact properly part of the biological sequence fragment.

-3p-barcode

This specifies stringency criteria (see below) for the barcode section of the best alignment between the read and the concatenation 3p-si+barcode+3p-ad in the geometry 3p-bc. All barcodes are tried and the best scoring barcode is accepted. This is different from the 5' barcode case described below. The barcode is currently hardcoded to start either at the first or second base of the adapter.

-5p-barcode

This specifies stringency criteria (see below) for the alignment between a 5' barcode and the read. All barcodes are tried and the read is accepted only if a single barcode passes these criteria.

-5p-sinsert

This specifies stringency criteria for a match between the read and a sequence insert that should be present at the beginning of the read (after stripping of the 5' barcode if present).

The following options all take a similar form of argument. Where an X is written below, reaper ignores the corresponding parameter due to the availability of a canonical value (e.g. length of barcode). It should ordinarily not be necessary to specify the offset parameter (below given as o).

-3p-global l/e/g/o
-3p-prefix l/e/g/o

The default value for o is 2.

-3p-barcode X/e/g
-5p-barcode X/e/g/b

The first (l) position is ignored, and the fourth (b) position is interpreted as a bit field (see below).

-5p-sinsert X/e/g/o

The default value for o is 10.

-mr-tabu l/e/g/o

The argument to each is of the form l/e/g/o (example: 14/2/0/0), as explained below. Generally speaking, the first, second, third and fourth positions respectively denote parameters related to length, edit distance, gap size, and offset. Exceptions exist, such as the fourth position in the -5p-barcode argument.

It is possible to leave out any trailing part, so l/e or 12/2 is a valid form as well. The meaning of the separate parts are as follows. The context is that of the best local alignment found for a read and an adapter sequence, to decide whether that alignment represents a true match. It is judged to represent a true match if the alignment has a subpart satisfying the following criteria.

 l
The subpart stretches over at least l bases.
 e
There are no more than e edits in the subpart.
 g
The total gap length in the subpart does not exceed g.
 o
The match occurs within an offset of o bases. The exact meaning depends on the part being matched (see below). A zero value implies on offset requirement at all.

The fourth b field in the -5p-barcode option is a bit field. Bit 1 implies that a zip alignment is attempted first between barcode and read. A zip alignment is one where only mismatches are allowed and the beginning of the barcode is aligned precisely to the beginning of the read. This can be useful for short barcodes, where matching barcodes at an offset in the read may too easily lead to false matches. Bit 2 implies that an alignment is attempted where the latter is allowed. In this case the number of offset positions are counted as edits and contribute towards the total number of edits (that will be compared with the e part). It is possible to combine the two approaches. By default the fourth field is set to 2. For short barcodes it can be advisable to set it either to 1 or 3.

For adapter matching and tabu matching the offset position, if utilised, indicates how far from the start of the adapter the match is allowed to occur.

For -5p-sinsert the offset position indicates how far into the read the sequence insert is allowed to extend beyond the length of the sequence insert itself. For very short inserts it is useful to set this to zero or a very small value.

The way reads are processed is dependent on the presence or absence of barcodes and the geometry (i.e. lay-out) of the read.

For geometry no-bc, matching is attempted with up to three different types of oligo fragments, all of which can be optional.

If specified in the 5p-sinsert column or with the -5psi option, a match is attempted between this sequence insert and the start of the read. If no such match is found, the read is by default discarded. If the keep-all option is specified however such reads are truncated to be empty and kept as clean reads.

Subsequently, if specified in the 3p-ad column or with the -3pa option, a match is attempted with that adapter and the read. Matching is attempted in a cascading fashion, as described above under

-3p-global
-3p-prefix
-3p-head-to-tail

If a match is present, it is removed. Naturally, a match is not required. Further trimming is attempted as described in section Further read processing. After this, if the read passes the length cutoff (by default no such cutoff is enforced), then the read is checked for contamination by tabu sequences, if any.

With the no-bc geometry reaper can be used either with or without a metadata file. If no metadata file is used the option -3pa is required. This option specifies the 3' adapter. It is still possible to turn off 3' adapter matching by passing either the empty string or a single hyphen as argument to -3pa.

Reaper processing in the no-bc geometry can be fully controlled without a metadata file by using the options -3pa, -tabu and -5psi.

For geometry 3p-bc, a match is attempted with the concatenation 3p-si+barcode+3p-ad, as described above under

-3p-global
-3p-prefix
-3p-barcode

A match is required for the read to pass, so that it can be allocated to a barcode bucket. For all barcodes, the full concatenation is aligned and checked against the criteria specified by -3p-global and -3p-prefix, in that order. One of these two checks should succeed for a concatenation to be considered further. Additionally, the barcode section of the concatenated oligonucleotide is separately required to pass the criteria specified by -3p-barcode.

It should be noted that it is possible for multiple barcodes to pass the procedure just described, especially if barcodes are short. The reason is that read errors are increasingly likely towards the end of a read and may occur in the barcode section. The match criteria thus have to allow for a positive edit distance. Hence, alignment scores are tracked and the concatenation (and associated barcode) with the best score is chosen. The case where two or more barcodes are tied is considered a conflict, and none is chosen.

Barcode resolution is followed by further attempted trimming as described in Further read processing. After this, if the read passes the length cutoff (by default no such cutoff is enforced), then tabu sequences are checked, if any.

For geometry 5p-bc a barcode must be found at the beginning of the read. All barcodes are aligned and checked using the criteria specified by the -5p-barcode option. Subsequently, a match with the 5' sequence insert is required if such an insert is specified in the 5p-si column of the metadata file. This match can be controlled using the option 5p-sinsert. Subsequently, matching of the 3' adapter is attempted as described above under

-3p-global
-3p-prefix
-3p-head-to-tail

and this is combined with further read trimming as described in Further read processing.

Before and after the check for adapter presence, a number of other criteria are tested, refered to as trimming tests further below. These test may establish that trimming is necessary independent of whether an adapter match is present or not. They are described below, and include checks for low base call quality, occurrence of (too many) N-masked bases, and low-complexity sequence. Each test that is employed can potentially necessitate the removal of a suffix of certain length. If at least one of these tests succeeds, the longest such suffix will be removed.

Low quality sequence can be detected using the median quality value in a sliding window. The quality-based trimming test is defined by the first base where this median value drops below a specified threshold. This test is applied to the full read, bar an initial (adjustable) offset that is skipped. The cutoff value and window length are specified using -qqq-check <cut-off>/<length>. By default the trim cutoff is at the start of the window. Optionally an offset into the window can be specified using -qqq-check <cut-off>/<length>/<offset>. It is possible (and the default approach) to omit an initial part of the read from inspection by using -qqq-check <cut>/<length>/<offset>/<prefix-size>. This is useful for sequencers that require calibration during the first few sequencing cycles. The default value for <prefix-size> is 10.

For -qqq-check the cutoff relates to the raw ascii values found in the file and expressly not to the transformed P-values they represent.

B is a special Illumina quality score indicating the base at that position should not be trusted. Runs of trailing Bs can be detected by specifying either --bcq-early or --bcq-late. With the former option the trimming is applied to the full read (before the adapter is matched), and consequently can lead to stripping of bases that could aid in adapter recognition. With the latter option the trimming test is applied after adapter matching and handled simultaneously with the other trimming tests.

Trailing As can be detected using -polya <count>. This test only succeeds if the poly-A sequence has length at least <count>. It is applied to the full read, before the adapter is matched. It is worthwhile to be aware of the more powerful -dust-suffix and -dust-suffix-late options. These will also remove poly-A stretches interspersed with other bases, and it is possible to make these options exclusively sensitive to A-rich sequences by using for example -dust-suffix 20/A and -dust-suffix-late 20/A. At the recommended default DUST cutoff 20 the smallest poly-A stretch recognised by these options is of length 6. Hence, depending on the data, the -polya option may still be of use.

Bases may not be called and show up as N in the read. This trimming test uses a sliding window to identify whether a read suffix should be removed. Use -nnn-check <count>/<length> to test for the presence of a window of size <length> in which at least <count> Ns occur. The suffix starting at the first such window is the result of this test.

Reaper has two mechanisms to deal with low complexity sequence. The two use very similar principles, as explained below.

The first approach that can be taken is to test for low complexity sequence at the end of a read, before adapter sequence is removed. This is done by computing the same criterion as used by DUST. The read suffix with highest DUST score is considered. The trimming test succeeds and this suffix is used if the score exceeds or is equal to the threshold specified as -dust-suffix <threshold>.

A variant of this, still part of the first mechanism, is to apply this test after initial trimming by other means, such as trimming by adapter, quality, or occurrence of N-masked bases. Sequencing protocols now exist that require this. In some protocols it is possible that a 3' adapter is still preceded by low complexity sequence, and in that case the detection test that starts at the end of the read can not extend beyond the adapter. An adapter-aware test can be specified independently with -dust-suffix-late <threshold> and works identically to -dust-suffix. It is possible and can be useful to use both options simultaneously, and a good value to use is the default DUST threshold 20.

For both these options, -dust-suffix and -dust-suffix-late, it is possible to restrict them to specific types of low-complexity sequence, namely those dominated by a single base. Examples are poly-A stretches interspersed with other bases or poly-T stretches interspersed with other bases. To exclude only such stretches use -dust-suffix 20/A to exlude poly-A dominated sequence or -dust-suffix 20/AT to exclude sequence that is either poly-A dominated or poly-T dominated. Other combinations of bases are allowed. A sequence is recognised as a poly-base low complexity sequence if it satisfies the DUST criterion and at least half of the bases are the same nucleotide and that nucleotide has been specified. If no bases are specified, no such demand is made.

The second mechanism removes the final cleaned read if it does not satisify the threshold specified by -tri <threshold>. It is an accident of history that this criterion is a scaled version of the DUST criterion, applied to the full length of the clean read. It was proposed by Hervé Pagès, and a given DUST threshold (used by -dust-suffix and -dust-suffix-late) must be multiplied by two to obtain the corresponding -tri threshold. A reasonable threshold to choose for this option, if used, is thus -tri 40, corresponding with the default DUST threshold of 20.

It can be useful to refrain from tri-nucleotide filtering so that quality control plots give a comprehensive view. It is possible to output the trinucleotide score alongside the read with reaper (using the formatting directive %T) and filter at a later stage.

After adapter matching and trimming the length of a read may be short. With -clean-length <length> any read shorter than <length> will be discarded.

Again, it can be useful to refrain from length-filtering so that quality control plots give a comprehensive view. Empty cleaned reads generally correspond to adapter polymers, and a peak at this length could correlate with size selection or ligation issues.

The examples below assume a directory called out exists.

Baseline reaper invocation, 5' barcode.
reaper -geom 5p-bc -meta ex5pbc.meta -i ex5pbc.fastq.gz -basename out/ex5pbc
Baseline reaper invocation, 3' barcode.
reaper -geom 3p-bc -meta ex3pbc.meta -i ex3pbc.fastq.gz -basename out/ex3pbc
Baseline reaper invocation, no barcode, using metadata file.
reaper -geom no-bc -meta exnobc.meta -i exnobc.fastq.gz -basename out/nobcx3
Baseline reaper invocation, no barcode.
reaper -geom no-bc -i exnobc.fastq.gz -basename out/exnobc -3pa TCGTATGCCGTCTTCTGCTTGT -tabu GTTCAGAGTTCTACAGTCCGACGATC
Remove any read containing one or more Ns, do not produce QC files.
reaper -geom no-bc -i ex.fastq.gz --noqc --full-length -basename out/ex -nnn-check 1/1 -3pa ""
Reaper help — see also the following section.
reaper -h

Below is the reaper version for this document and the corresponding help output, resulting from reaper --version and reaper -h respectively.

Reaper version: reaper-15-065 Required options -geom <mode> mode in {no-bc, 3p-bc, 5p-bc} -meta <fname> file with geometry-dependent format. Required columns: Geometry Columns: no-bc 3p-ad - - - tabu 3p-bc 3p-ad barcode 3p-si - tabu 5p-bc 3p-ad barcode - 5p-si tabu bc=barcode, ad=adaptor, si=sequence insert Columns 3p-si, 5p-si, 3p-ad and tabu may all be empty Alternatively, to express absence, a single hyphen may be used Important options -i <fname> input stream (gzipped file allowed) (default STDIN) -clean-length <int> minimum allowed clean length (default 0) -guard <int> protect first <int> bases in read from adapter and tabu matching -restrict <int> only use the first <int> bases of adapter or tabu sequence (default 18) This is to avoid false positive matches -tri <threshold> filter out reads with tri-nt score > threshold a reasonable <threshold> is 35 -qqq-check <val>/<winlen> cut sequence when median quality drops below <val> -qqq-check <val>/<winlen>/<winofs> as above, cut at <winofs> (default 0) -qqq-check <val>/<winlen>/<winofs>/<readofs> as above, start at <readofs> -dust-suffix <threshold> dust theshold for read suffix (default 0, suggested 20) -nnn-check <count>/<outof> (default 0/0) disregard read onwards from seeing <count> N's in <outof> bases Alignment options Options to specify when part of an alignment triggers a match: -3p-global l/e[/g[/o]] (default 14/2/1/0) -3p-prefix l/e[/g[/o]] (default 8/2/0/2) -3p-barcode l/e[/g[/o]] (default 0/6/1/0) -5p-barcode l/e[/g[/o]] (default 0/0/0/2) -5p-sinsert l/e[/g[/o]] (default 0/2/1/10) -mr-tabu l/e[/g[/o]] (default 16/2/1/0) -3p-head-to-tail l minimal trailing perfect match length (default 0) syntax used in the above: l <int> minimum length required to count sub-alignment as match e <int> maximum allowed edit distance g <int> [optional, not active when set to 0] maximum allowed number of gaps o <int> [optional, not active when set to 0] offset: o= 5 requires alignment to start in the first five bases of adaptor o=-5 requires alignment to end in the last five bases of adaptor -swp M/S/G match/substitution/gap gain/cost/cost (default 4/1/3) Input/output options --fasta-in read FASTA input -record-format <format string> (record description, default @%I%A%n%R%n+%#%Q%n) [ -record-format syntax is output when supplying --record-format ] -record-format2 <format string> (simple line formats, one field per line): R read I read identifier Q quality scores D discard field -basename <pfx> pfx.lint.gz, pfx.clean.gz pfx.report etc will be constructed -format-clean <format string> (output for clean reads) -format-lint <format string> (output for filtered reads) -format-clean/lint specification syntax: %R read %C clean read %Z clean read padded with Ns if necessary %V reverse complement of clean read %I read identifier %Q clean or input read quality (for clean / lint file respectively) %X read count (only applicable if -record-format is used) %Y input read quality %q<c> clean input read quality padded with character <c> %A annotation field %L clean read length %M message describing cause for filtering (lint file) %T trinucleotide complexity score (clean/lint file) %U dUst sUffix complexity information %3 best read/3p-adaptor alignment %= alignment characteristics mt=matchtype sc=suffix-complexity ht=head-tail-match nn=N-match-offset bb=B-match-offset aa=Polya-offset qq=Quality-trim-offset %n newline %J record offset, unique for each read. Use to match paired-end reads %f fastq line number based on standard fastq format %t tab %% percent sign Anything else is copied verbatim -debug [acl]+ a=alignments c=clean l=lint -sample c/l if debug, sample every c/l clean/lint read --nozip do not output gzipped files --noqc do not output quality report files Miscellaneous options --bcq-early perform early 'B' quality filtering (when reading sequences) --bcq-late perform late 'B' quality filtering (before outputting sequences) --full-length only allow reads not shortened in any filter step --keep-all delete rather than discard reads (e.g. tabu match, missing 5p-sinsert) -trim-length <int> cut reads back to length <int> -polya <int> remove trailing A's if length exceeds <int> -sc-max <f> threshold for complexity of suffix following prefix match (default 0.25) Options for use when running reaper with -geom no-bc -3pa <three prime adaptor> -5psi <five prime sequence insert> -tabu <tabu sequence>

By default reaper expects FASTQ input and writes FASTQ output. A few options exist providing other formats. Additionally it is possible to specify custom input and output format using two simple format specification syntaxes.

FASTA input and output are specified using --fasta-in and --fasta-out. A slightly extended FASTQ format is specified with --fastqx-out. In this format the identifier line is extended with a recno=<NUM> field, where <num> indicates the record offset in the input file. Such a field is necessary for example to re-pair paired-end data after processing with reaper; tally reads this format when equipped with the --fastqx-in option.

The output format specification syntax is specified in the previous section, as part of the reaper help output under -format-clean/lint specification syntax. The input format specification language is given below. It can be accessed on the command line by issuing reaper --record-format.

-record-format <format string> (extended record description syntax) %R expect read (longest sequence found over [a-zA-Z]*) - empty read allowed %I expect identifier (longest sequence of non-blank) %A expect overflow annotation field (initial blanks skipped, rest of line) %a use previous %F or %G or %H field as annotation %B expect annotation field (longest sequence of non-blank) %Q expect quality (longest sequence of non-blank) %X expect count (a nonnegative integer number) %F expect and discard field (longest sequence of non-tab) %G expect and discard field (longest sequence of non-blank) %Hx (x is a placeholder) expect and discard up to x %# discard everything until end of line %b expect run of blanks (space or tab) %n expect end of line match %. expect and discard any character %s expect a space %t expect a tab %% expect a percent sign Anything else requires a literal match, see the FASTA example just below The string '>%I%#%R%n' will parse FASTA-ish input (but use --fasta-in instead) The string '%F%t%R%#' retrieves the second field from each line

Do not use -basename foo/bar/ -- then files such as foo/bar/.clean.gz will be created (and not be visible with default ls).

Reaper was written by Stijn van Dongen and benefited greatly from suggestions by Cei Abreu-Goodger, Anton Enright, Mat Davis, Sergei Manakhov, Harpreet Saini, Nenad Bartonicek and Leonor Quintais. For questions and feedback send e-mail to  kraken @ ebi . ac . uk.

This page was generated from Aephea macros, http://micans.org/aephea.