April 17, 2015
Deduplication of sequence fragments

Tally is a program for deduplicating sequence fragments. It minimises memory usage by compressing sequences and using compact memory allocation techniques. A built-in parser allows a variety of input file formats and a simple specification language allows flexible output file formats. It can be made aware of paired-end reads, and it can handle degenerate sequence inserts intended to reveal amplification biases. Tally comes with reaper, a program for demultiplexing, trimming and filtering short read sequencing data.

In its simplest form tally usage is as follows. This example uses as example input file name (the argument to the -i option) the default reaper output name.

tally -i out.lane.clean.gz -o out.lane.unique.gz

By default tally writes gzipped files. It is possible to prevent this using the --nozip option.

tally -i out.lane.clean.gz -o out.lane.unique --nozip

By default tally expects FASTQ format. Other formats can be specified using the -record-format option (refer to Examples). When processing paired-end files with reaper and tally, the easiest approach is to use the reaper --fastqx-out option and the tally --fastqx-in option.

It is possible to retain quality information by supplying the --with-quality option. Tally will then keep track of read quality; for redundant (identical) reads it will, for a given base, keep the best quality score for that base among the reads.

Note that retaining quality increases the memory requirements approximately fivefold. This substantial increase is due to the fact that quality scores are stored in raw form. Sequences are stored in compressed format making tally very memory-efficient if quality does not need to be tracked.

By default tally will try to set parameters automatically by inspecting the input file. This requires that the input file is searchable, i.e. not streamed via a pipe. This automatic behaviour can be turned off using the option --no-auto. It is possible to obtain a rough estimate of memory usage using --peek.

Tally can read paired end samples. Refer to sections Paired end read processing and Examples.

Tally can tally already counted reads, meaning that it can read in counts associated with reads. This requires usage of the -record-format option and the %X directive (see Examples).

Tally can read a variety of formats. The input syntax can be described with the -record-format option, accepting a syntax nearly identical to the identically named reaper option. It should be noted that tally assigns meaning to very few fields. These are the read itself (%R), a count associated with the read (%X), a record offset (%J — required when processing paired files) and a string identifier for paired-end safety checking (%I — refer to Paired end read processing).

Tally can handle degenerate sequence inserts. The -dsi <num> option causes tally to strip the first <num> bases from the read upon output. The output may thus still contain duplicated fragments; these are informative for amplification biases.

Tally generates summary statistics on the number of reads read and produced. By default these are sent to the diagnostic stream. It is possible to redirect them to file by using the -sumstat <file-name> option.

Tally supports paired end read processing in the following two scenarios. In the simplest case it is assumed that two sample files with the same number of reads are provided. Reads at identical offsets in the respective files correspond to paired ends. In this scenario the user need only specify the extra input file with the -j option and the extra output file with the -p option.

In the more involved case the paired files are processed with reaper or another program, and after processing the implicit record-by-record or line-by-line correspondence between the files may be lost. It is then neceessary that record offset information is attached to the records. This will be used by tally to pair up records. The reaper program can be instructed to include offset information with the --fastqx-out option. Tally can be instructed to read this format by giving it the --fastqx-in option. Alternatively custom formats can be created with the reaper output -format-clean option and the tally input option -record-format. These must be matched and use the record-offset encoding %J directive. The reason for using record offset numbers as identifiers is that a numerical and increasing identifier can be used to efficiently pair reads from different files.

If tally is instructed to parse FASTQ identifiers (this is the case when using --fastqx-in for example) it will check whether identifiers are identical for paired reads. The number of non-matching identifiers is reported after processing.

Paired end processing is enabled by supplying tally with two input file names as arguments to the -i and -j options. When processing paired end reads output can be sent to a single file or to two files. To send output to a single file just use the -o file output option, the -format option and %A and %B directives for the first, respectively second fragment. To send output to two files, additionally use the -p file output option. In this case, the %R directive becomes context aware. It refers to the first fragment when outputting to the first file (specified by -o) and to the second fragment when outputting to the second file (specified by -p).

Baseline tally invocation, FASTQ input and output
tally -i out.lane.clean.gz -o out.lane.unique.gz

The first three records will look similar to this:

@trn_1 31874 TAGCTTATCAGACTGATGTTGAC + ~~~~~~~~~~~~~~~~~~~~~~~ @trn_2 26764 ACTCAAACTGGGGGCTCTTTT + ~~~~~~~~~~~~~~~~~~~~~ @trn_3 11866 AGTGCCGCAGAGTTTGTAGTGT + ~~~~~~~~~~~~~~~~~~~~~~

Note that FASTQ identifiers have disappeared. The output identifiers are formed by the fixed string trn (for tally record number) followed by the output record number. The number of times a read was observed is specified on the identifier line as the second field.

Changing the output format

A simple line-based format can be obtained using e.g. -format '%R%t%X%n'. The output (containing reads and their counts) will look like this:


The set of supported directives available to the -format option is this:

%R read %L length %C number of occurrences %X number of occurrences %T trinucleotide score %I read identifier - numerical identifier constructed on output %t tab %n newline %% percentage character

To obtain FASTA output, use --fasta-out.

Changing the input format

FASTA format can be read using the --fasta-in option. Other formats can be read using the -record-format option (see Examples).

Reading in counted data

Tally can read in data that is already counted. This requires usage of -record-format and the %X directive. To read in a simple two-field line-based format consisting of read and count one would use:

-record-format '%R%b%X%n'
Retaining quality

To retain quality information, use --with-quality. Memory usage will increase approximately five-fold. For each base the best quality score is recorded. Example output:

Re-pairing and tallying reaper-processed paired-end files

In this scenario it is assumed that each of the paired files was processed independently by reaper, and as a result the record offset correspondence between the two files is lost. The invocation below assumes reaper was given the --fastqx-out option. This causes the introduction of an additional field on the identifier line containing the record offset number, allowing tally to pair up the two files. Options such as --with-quality can be given additionally.

tally -i out1.gz -j out2.gz -o out1.unique.gz -p out2.unique.gz --fastqx-in
Re-pairing reaper-processed files without tallying

In this scenario identical reads will not be collapsed. This simply requires the use of the --no-tally option. Please note that tally will in this case output the identifiers found in the input file. Options such as --with-quality can be given additionally.

tally -i out1.gz -j out2.gz -o out1.unique.gz -p out2.unique.gz --fastqx-in --no-tally
Tallying implicitly paired files

In this scenario two files are implicitly paired, such as is the case for unprocessed paired-end FASTQ files. Tally will process both files record-by-record and pair up records at the same offset. This requires the option --pair-by-offset.

tally -i out1.gz -j out2.gz -o out1.unique.gz -p out2.unique.gz --pair-by-offset
Re-pairing files in other formats

This can be achieved only if the record format can be recognised by the -record-format specification language, and requires the presence of a record offset field in the record format. This field can be read using the %J directive. The specification language is documented below.

%R expect read (longest sequence found over [a-zA-Z]*) - empty read allowed %J expect record offset (integer) %I expect identifier (longest sequence of non-blank) %Q expect quality (longest sequence of non-blank) %X expect count (a nonnegative integer number) %F expect and discard field (longest sequence of non-tab) %G expect and discard field (longest sequence of non-blank) %# discard everything until end of line %b expect run of blanks (space or tab) %n expect end of line match %. expect and discard any character %s expect a space %t expect a tab %% expect a percent sign

The directives listed above are all placeholders that will be filled by tally with the appropriate character or field. All such placeholders start with a percent sign (%). Anything that is not a placeholder will be copied verbatim. An example is the standard count-extended FASTQ format output by reaper when using --fastqx-out) and expected by tally (using --fastqx-in. It corresponds to:

-record-format '@%I%brecno=%J%#%R%n+%n%Q%n'

Tally was written by Stijn van Dongen and benefited greatly from suggestions by Anton Enright, Mat Davis, Sergei Manakhov, Nenad Bartonicek and Leonor Quintais. For questions and feedback send e-mail to  kraken @ ebi . ac . uk.