The Program RNAsubopt
Introduction
By default, RNAsubopt
calculates all suboptimal secondary structures within a
given energy range above the MFE
structure [Wuchty et al., 1999].
Note
Be careful, the number of structures returned grows exponentially with both sequence length and energy range.
Suboptimal folding
Generate all suboptimal structures within a certain energy range from the
MFE
specified by the-e
option:$ RNAsubopt -e 1 -s < test.seq CUACGGCGCGGCGCCCUUGGCGA -500 100 ...........((((...)))). -5.00 ....((((...))))........ -4.80 (((.((((...))))..)))... -4.20 ...((.((.((...)).)).)). -4.10
The text output shows an energy sorted list (option -s
) of all secondary
structures within 1~kcal/mol of the MFE
structure. Our sequence actually
has a ground state structure (-5.70) and three structures within 1~kcal/mol
range.
MFE
folding alone gives no indication that there are actually a number
of plausible structures. Remember that RNAsubopt
cannot automatically
plot structures, therefore you can use the tool RNAplot
. Note that you
can’t simply pipe the output of RNAsubopt
to RNAplot
using:
$ RNAsubopt < test.seq | RNAplot
You need to manually create a file for each structure you want to plot.
Here, for example we created a new file named suboptstructure.txt
:
> suboptstructure-4.20
CUACGGCGCGGCGCCCUUGGCGA
(((.((((...))))..)))...
The fasta header is optional, but useful (without it the outputfile will
be named rna.ps
).
The next two lines contain the sequence and the suboptimal structure you want to plot; in this case we plotted the structure with the folding energy of -4.20.
Then plot it with
$ RNAplot < suboptstructure.txt
Note that the number of suboptimal structures grows exponentially with
sequence length and therefore this approach is only tractable for
sequences with less than 100 nt. To keep the number of suboptimal
structures manageable the option --noLP
can be used, forcing
RNAsubopt
to produce only structures without isolated base
pairs. While RNAsubopt
produces all structures within an
energy range, mfold
produces only a few, hopefully representative,
structures. Try folding the sequence on the mfold
server at http://mfold.rna.albany.edu/?q=mfold.
Sometimes you want to get information about unusual properties of the Boltzmann ensemble (the sum of all RNA structures possible) for which no specialized program exists. For example you want to know all fractions of a bacterial mRNA in the Boltzmann ensemble where the Shine-Dalgarno (SD) sequence is unpaired. If the SD sequence is concealed by secondary structure the translation efficiency is reduced.
In such cases you can resort to drawing a representative sample of
structures from the Boltzmann ensemble by using the option
-p
. Now you can simply count how many structures in the sample
possess the feature you are looking for. This number divided by the
size of your sample gives you the desired fraction.
The following example calculates the fraction of structures in the ensemble that have bases 6 to 8 unpaired.
Sampling the Boltzmann Ensemble
RNAsubopt
also implements a statisctical sampling algorithm to
draw secondary structures from the ensemble according to their equilibrium
probability [Ding and Lawrence, 2003]:
Draw a sample of size 10,000 from the Boltzmann ensemble
Calculate the desired property, e.g. by using a
perl
script:$ RNAsubopt -p 10000 < test.seq > tt $ perl -nle '$h++ if substr($_,5,3) eq "..."; END {print $h/$.}' tt 0.391960803919608
A far better way to calculate this property is to use RNAfold -p
to get the ensemble free energy, which is related to the partition
function via \(F = -RT\ln(Q)\), for the unconstrained (\(F_u\))
and the constrained case (\(F_c\)), where the three bases are not
allowed to form base pairs (use option -C
), and evaluate
\(p_c = \exp((F_u - F_c)/RT)\) to get the desired probability.
So let’s do the calculation using RNAfold
:
$ RNAfold -p
Input string (upper or lower case); @ to quit
....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8
CUACGGCGCGGCGCCCUUGGCGA
length = 23
CUACGGCGCGGCGCCCUUGGCGA
...........((((...)))).
minimum free energy = -5.00 kcal/mol
....{,{{...||||...)}}}.
free energy of ensemble = -5.72 kcal/mol
....................... { 0.00 d=4.66}
frequency of mfe structure in ensemble 0.311796; ensemble diversity 6.36
Now we have calculated the free ensemble energy of the ensemble over all structures \(F_u\), in the next step we have to calculate it for the structures using a constraint (\(F_c\)).
Following notation has to be used for defining the constraint:
|
: paired with another base.
: no constraint at allx
: base must not pair<
: base i is paired with a base j<i>
: base i is paired with a base j>imatching brackets
( )
: base i pairs base j
So our constraint should look like this:
.....xxx...............
Next call the application with following command and provide the sequence and constraint we just created:
$ RNAfold -p -C
The output should look like this:
length = 23
CUACGGCGCGGCGCCCUUGGCGA
...........((((...)))).
minimum free energy = -5.00 kcal/mol
...........((((...)))).
free energy of ensemble = -5.14 kcal/mol
...........((((...)))). { -5.00 d=0.42}
frequency of mfe structure in ensemble 0.792925; ensemble diversity 0.79
Afterwards evaluate the desired probability according to the formula given before
e.g. with a simple perl
script:
$ perl -e 'print exp(-(5.72-5.14)/(0.00198*310.15))."\n"'
You can see that there is a slight difference between the RNAsubopt
run with 10,000
samples and the RNAfold
run including all structures.