The Program RNAalifold
Introduction
RNAalifold
generalizes the folding algorithm for multiple sequence
alignments (MSA), treating the entire alignment as a single generalized
sequence. To assign an energy to a structure on such a generalized
sequence, the energy is simply averaged over all sequences in the
alignment. This average energy is augmented by a covariance term, that
assigns a bonus or penalty to every possible base pair \((i,j)\)
based on the sequence variation in columns \(i\) and \(j\) of
the alignment.
Compensatory mutations are a strong indication of structural
conservation, while consistent mutations provide a weaker signal. The
covariance term used by RNAalifold
therefore assigns a bonus
of 1 kcal/mol to each consistent and 2 kcal/mol for each compensatory
mutation. Sequences that cannot form a standard base pair incur a
penalty of \(-1\) kcal/mol. Thus, for every possible consensus pair
between two columns \(i\) and \(j\) of the alignment a covariance score
\(C_{ij}\) is computed by counting the fraction of sequence pairs
exhibiting consistent and compensatory mutations, as well as the
fraction of sequences that are inconsistent with the pair. The weight
of the covariance term relative to the normal energy function, as well
as the penalty for inconsistent mutations can be changed via command
line parameters.
Apart from the covariance term, the folding algorithm in
RNAalifold
is essentially the same as for single sequence
folding. In particular, folding an alignment containing just one
sequence will give the same result as single sequence folding using
RNAfold
. For \(N\) sequences of length \(n\) the required CPU
time scales as \(\mathcal{O}(N\cdot n^2 + n^3)\) while memory
requirements grow as the square of the sequence length. Thus
RNAalifold
is in general faster than folding each sequence
individually. The main advantage, however, is that the accuracy of
consensus structure predictions is generally much higher than for
single sequence folding, where typically only between 40% and 70% of
the base pairs are predicted correctly.
Apart from prediction of MFE
structures RNAalifold
also implements an algorithm to compute the partition function over
all possible (consensus) structures and the thermodynamic equilibrium
probability for each possible pair. These base pairing probabilities
are useful to see structural alternatives, and to distinguish well
defined regions, where the predicted structure is most likely correct,
from ambiguous regions.
As a first example we’ll produce a consensus structure prediction for the following four tRNA sequences.
$ cat > four.seq
>M10740 Yeast-PHE
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
>K00349 Drosophila-PHE
GCCGAAAUAGCUCAGUUGGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
>K00283 Halobacterium volcanii Lys-tRNA-1
GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
>AF346993
CAGAGUGUAGCUUAACACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUGACCGCUCUGA
RNAalifold
uses aligned sequences as input. Thus, our first step will be
to align the sequences. We use clustalw2
in this example, since it’s one
of the most widely used alignment programs and has been shown to work well on
structural RNAs. Other alignment programs can be used (including programs that
attempt to do structural alignment of RNAs), but for this example the resulting
multiple sequence alignment should be in Clustal
format. Get clustalw2
and install it as you have done it with the other packages:
http://www.clustal.org/clustal2.
RNAalifold
Output Files
Content of the alifold.out
file:
4 sequence; length of alignment 78
alifold output
6 72 0 99.8% 0.007 GC:2 GU:1 AU:1
33 43 0 98.9% 0.033 GC:2 GU:1 AU:1
31 45 0 99.0% 0.030 CG:3 UA:1
15 25 0 98.9% 0.045 CG:3 UA:1
5 73 1 99.7% 0.008 CG:2 GC:1
13 27 0 99.1% 0.042 CG:4
14 26 0 99.1% 0.042 UA:4
4 74 1 99.5% 0.015 CG:3
[...]
The last output file produced by RNAalifold -p
, named alifold.out
,
is a plain text file with detailed information on all plausible base pairs
sorted by the likelihood of the pair. In the example above we see that the
pair \((6,72)\) has no inconsistent sequences, is predicted almost with
probability 1, and occurs as a GC
pair in two sequences, a GU
pair
in one, and a AU
pair in another.
RNAalifold
automatically produces a drawing of the consensus structure
in Postscript format and writes it to the file alirna.ps
. In the
structure graph consistent and compensatory mutations are marked by a
circle around the variable base(s), i.e. pairs where one pairing partner
is encircled exhibit consistent mutations, whereas pairs supported by
compensatory mutations have both bases marked. Pairs that cannot be formed
by some of the sequences are shown gray instead of black.
The structure layout and dotplot files alirna.ps
and alidot.ps
should look as follows:
In the example given, many pairs show such inconsistencies. This is because
one of the sequences (AF346993) is not aligned well by clustalw
.
Note
Subsequent calls to RNAalifold
will overwrite any existing output
alirna.ps
(alidot.ps
, alifold.out
) files in the current
directory. Be sure to rename any files you want to keep.
Structure predictions for the individual sequences
The consensus structure computed by RNAalifold
will contain only
pairs that can be formed by most of the sequences. The structures of the
individual sequences will typically have additional base pairs that are not
part of the consensus structure. Moreover, ncRNA may exhibit a highly
conserved core structure while other regions are more variable. It may
therefore be desirable to produce structure predictions for one particular
sequence, while still using covariance information from other sequences.
This can be accomplished by first computing the consensus structure for all
sequences using RNAalifold
, then folding individual sequences using
RNAfold -C
with the consensus structure as a constraint. In
constraint folding mode RNAfold -C
allows only base pairs to form
which are compatible with the constraint structure. This resulting
structure typically contains most of the constraint (the consensus
structure) plus some additional pairs that are specific for this sequence.
The refold.pl
script removes gaps and maps the consensus structure to
each individual sequence.
$ RNAalifold RNaseP.aln > RNaseP.alifold
$ gv alirna.ps
$ refold.pl RNaseP.aln RNaseP.alifold | head -3 > RNaseP.cfold
$ RNAfold -C --noLP < RNaseP.cfold > RNaseP.refold
$ gv E-coli_ss.ps
If you compare the refolded structure (E-coli_ss.ps
) with the
structure you get by simply folding the E.coli sequence in the
RNaseP.seq
file (RNAfold --noLP
) you find a clear rearrangement.
In cases where constrained folding results in a structure that is very different from the consensus, or if the energy from constrained folding is much worse than from unconstrained folding, this may indicate that the sequence in question does not really share a common structure with the rest of the alignment or is misaligned. One should then either remove or re-align that sequence and recompute the consensus structure.
Note
Note that since RNase P forms sizable pseudo-knots, a perfect prediction is impossible in this case.