Markov Background Model Format

Description

This file format is used by many programs in the MEME Suite to model background probabilities. Most of the MEME Suite programs only use the 0-order model though it doesn't matter if the higher order information is in the file, it will be silently ignored.

Format Specification

Each line may contain either:

The letter chains are defined as one or more letters from either the DNA alphabet ACGT or the protein alphabet ACDEFGHIKLMNPQRSTVWY. If any letter chains specify letters that only appear in the protein alphabet then the background is assumed to be for the protein alphabet, otherwise it is assumed to be for the DNA alphabet.

The probabilities are numbers in the range 0 ≤ p ≤ 1. The may be in simple decimal (e.g., 0.00015) or use exponential notation (e.g., 1.5e-4). To be precise, each probability is a number p, where p can be matched by the regular expression ^([0]|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?$ and is in the range 0 ≤ p ≤ 1.

Each non-empty, non-comment line defines a letter chain and a probability for seeing that sequence of letters. For example a line with an A 0.25 defines the probability of seeing an "A" as 25%. If however the line had CGA 0.15625 then it would define the probability of seeing the sequence "CGA" as approximately 1.6%.

The summed probabilities of all chains of the same length should be approximately 1 (small allowances for rounding are made). Also the summed probabilities of all chains of the same length with a specific suffix should approximately equal the probability of that suffix.

It is important to note that probabilities of zero (or one) are not allowed because these cause asymptotic conditions in the equations used by our programs. They are also unlikely to be correct - just because the dataset used to calculate a background might not contain any instances of "CGAAA" does not mean that it is impossible. For this reason the tool fasta-get-markov automatically adds pseudocounts unless it is specifically told not to.

The file must contain a chain and probability pair for each possible alphabet chain length up to N + 1, where N is the "order" of the Markov model. Additionally the lines must be sorted by increasing chain length so the chain + probability pairs for the 0-order model with letter sequences of length 1 appear at the top and the Nth-order model with letter sequences of length N + 1 appear at the bottom.

For example, a 0-order Markov model file might contain:

#   order 0
a       0.324
c       0.176
g       0.176
t       0.324
      

A 1st-order Markov model file might contain:

#   order 0
A       2.563e-01
C       2.437e-01
G       2.437e-01
T       2.563e-01
#   order 1
AA      7.020e-02
AC      5.388e-02
AG      8.089e-02
AT      5.134e-02
CA      7.575e-02
CC      7.050e-02
CG      1.659e-02
CT      8.089e-02
GA      6.280e-02
GC      5.652e-02
GG      7.050e-02
GT      5.388e-02
TA      4.751e-02
TC      6.280e-02
TG      7.575e-02
TT      7.020e-02
      

See Also

A background model file can be created from any FASTA sequence file using the fasta-get-markov program.