Minion

April 17, 2015

minion-15-065

Minion

Adapter prediction

Minion is a simple program for assembling sequences from sequencing data using De Bruijn graphs. It has two main uses.

•

To infer 3' adapter sequence with minion search-adapter. This is especially useful if the experimental protocol metadata is no longer associated with a FASTQ file, and the 3' adapter is not known. A small list of candidate adapters can be cross-referenced (using the sibling program swan) with a FASTA file (again small) of known adapter sequences.

•

To assemble the most highly expressed (small RNA) transcripts with minion assemble as a very quick quality control. The resulting FASTA file can be cross-referenced (using the sibling program swan) with a reference FASTA file. This is foremost applicable to small RNA data, for example to quickly identify highly expressed microRNAs.

When searching for the 3' adapter without prior knowledge of the sequence, minion will output one or two candidates, accompanied with metadata and selected from a longer list of candidates. If the adapter sequence is known, minion will compare this adapter with all the candidates it found and output the best match.

By default minion expects FASTQ input and will try to read two million sequences. Input may have been compressed using gzip. Other formats can be specified using the -record-format option. This option works the same as the identically named option in reaper, and is described in the reaper documentation. The number of input sequences read by minion can be changed with the -do <NUM> option.

To infer an adapter sequence, issue

minion search-adapter -i data.fq.gz

To compare a known adapter sequence to a list of minion-computed sequences, issue

minion search-adapter -i data.fq.gz -adapter <SEQUENCE>

To compare a list of known adapter sequences in a FASTA file ADAPTERS.fa to a list of minion-computed sequences, issue

minion search-adapter -i data.fq.gz -show 3 -write-fasta minion.fasta swan -r ADAPTERS.fa -q minion.fasta

These modes and additional options are described in greater detail below.

Description

Minion creates a De Bruijn graph from an input FASTQ file specified on the command line. In its default setting it uses 12-mers. In this setting it requires relatively little memory (approximately 0.6G) and is suitable for interactive use on modest computing resources. For each increment in k a fourfold increase in memory is required, leading to a usage of approximately 10G for 14-mers.

Minion can be run in different modes. A mode will set several parameters that influence the analysis of the De Bruijn graph. In the case of adapter analysis, these are set such that the transition from adapter sequence to biological sequence (reading from 3' to 5') will be detected.

Mode search-adapter

When searching for adapter sequence, minion can be run in two ways, namely inference and test mode. Both work with the same set of candidate sequences (computed by minion), but differ in the way in which these are ordered and presented.

Both when infering or testing adapter sequence it is possible to show more candidate sequences using the -show <NUM> option.

Infering adapter sequence

3.1

In inference mode the candidates are ordered according to two different criteria, and the best entry according to each is displayed and output. The two criteria are unfortunately necessary due to the varying charachteristics of 3' adapter sequence in different experimental protocols. The first criterion is frequency of occurrence; the second criterion incorporates a fan-out measure that captures the typical characteristic of 3' adapter sequence of being attached to a multitude of different prefixes. To infer adapter sequence from sequencing data with minion, issue the following:

minion search-adapter -i data.fq.gz

To output more candidates use -show <num>.

Writing FASTA output

3.1.1

To write FASTA output (in addition to the screen output) use -write-fasta <file-name>.

Testing a known adapter sequence

3.2

In test mode the user supplies a query adapter sequence on the command line with the -adapter option. This sequence is compared to all candidate sequences. Minion subsequently ouptputs the candidate sequence that best matches the query sequence and shows the alignment between the two sequences. An example invocation is this:

minion search-adapter -i data.fq.gz -adapter <SEQUENCE>

To output more matches use -show <num>. To compute more candidate sequences use -N <num> (the default is 50).

Minion search-adapter output

3.3

Example minion output is shown below.

criterion=sequence-density sequence-density=25.68 sequence-density-rank=1 fanout-score=21.54 fanout-score-rank=9 prefix-density=49.06 prefix-fanout=21.5 sequence=TCGTATGCCGTCTTCTGCTTG

The sequence-density trait gives the prevalence of the infered adapter as a percentage of the number of reads read. For sRNA sequencing data this number will often be very high, as in the example.

The prefix-fanout trait is a weighted measure that can be interpreted as the number of distinct prefixes, of length three, observed with this particular candidate sequence. The maximal possible value for this trait is 64. The prefix-density trait is the total number of these prefixes (not necessarily distinct), as a percentage of the numbers of reads read.

The fanout-score criterion is the prefix-fanout trait, multiplied by the ratio of prefix-density to sequence-density, provided that this ratio is larger than one.

When infering adapters (i.e. without using -adapter) two candidate sequences will be shown. The second will start with the line criterion=fanout-score. The second should only be considered if the first candidate is clearly a biological sequence. This can be established by using one of the BLAST interfaces provided for example by NCBI and ENSEMBL.

If the user supplies a query adapter sequence, the output is prefixed with an alignment for each minion-derived sequence, as shown below.

(predicted sequence) TCGTATGCCGTCTTCTGCTTG ||||||||||||||||||||| TCGTATGCCGTCTTCTGCTTGT (query sequence) --- match-score=100 match-count=21 sequence-density=25.68 sequence-density-rank=1 fanout-score=21.54 fanout-score-rank=9 prefix-density=49.06 prefix-fanout=21.5 sequence=TCGTATGCCGTCTTCTGCTTG

Author/contact

Minion was written by Stijn van Dongen and benefited from suggestions by Anton Enright and Mat Davis. For questions and feedback send e-mail to kraken @ ebi . ac . uk.