RNAlib-2.4.14
|
|
The standard representation of a secondary structure in our library is the Dot-Bracket Notation (a.k.a. Dot-Parenthesis Notation), where matching brackets symbolize base pairs and unpaired bases are shown as dots. Based on that notation, more elaborate representations have been developed to include additional information, such as the loop context a nucleotide belongs to and to annotated pseudo-knots.
The Dot-Bracket notation as introduced already in the early times of the ViennaRNA Package denotes base pairs by matching pairs of parenthesis ()
and unpaired nucleotides by dots .
.
Example: A simple helix of size 4 enclosing a hairpin of size 4 is annotated as
((((....))))
A more generalized version of the original Dot-Bracket notation may use additional pairs of brackets, such as <>
, {}
, and []
, and matching pairs of uppercase/lowercase letters. This allows for anotating pseudo-knots, since different pairs of brackets are not required to be nested.
Example: The follwing annotations of a simple structure with two crossing helices of size 4 are equivalent:
<<<<[[[[....>>>>]]]] ((((AAAA....))))aaaa AAAA{{{{....aaaa}}}}
The WUSS notation, as frequently used for consensus secondary structures in Stockholm 1.0 format allows for a fine-grained annotation of base pairs and unpaired nucleotides, including pseudo-knots.
Below, you'll find a list of secondary structure elements and their corresponding WUSS annotation (See also the infernal user guide at http://eddylab.org/infernal/Userguide.pdf)
Base pairs
Nested base pairs are annotated by matching pairs of the symbols <>
, ()
, {}
, and []
. Each of the matching pairs of parenthesis have their special meaning, however, when used as input in our programs, e.g. structure constraint, these details are usually ignored. Furthermore, base pairs that constitute as pseudo-knot are denoted by letters from the latin alphabet and are, if not denoted otherwise, ignored entirely in our programs.
Hairpin loops
Unpaired nucleotides that constitute the hairpin loop are indicated by underscores, _
.
Example:
Bulges and interior loops
Residues that constitute a bulge or interior loop are denoted by dashes, -
.
Example:
Multibranch loops
Unpaired nucleotides in multibranch loops are indicated by commas ,
.
Example:
External residues
Single stranded nucleotides in the exterior loop, i.e. not enclosed by any other pair are denoted by colons, :
.
Example:
Insertions
In cases where an alignment represents the consensus with a known structure, insertions relative to the known structure are denoted by periods, .
. Regions where local structural alignment was invoked, leaving regions of both target and query sequence unaligned, are indicated by tildes, ~
.
Pseudo-knots
The WUSS notation allows for annotation of pseudo-knots using pairs of upper-case/lower-case letters.
Example:
Alternatively, one may find representations with two types of node labels, 'P' for paired and 'U' for unpaired; a dot is then replaced by '(U)', and each closed bracket is assigned an additional identifier 'P'. We call this the expanded notation. In [8] a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labeled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labeled 'U' and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers.
Bruce Shapiro proposed a coarse grained representation [21], which, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as
H
(hairpin loop),I
(interior loop),B
(bulge),M
(multi-loop), andS
(stack).We extend his alphabet by an extra letter for external elements E
. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-bracket form) can be encapsulated into a virtual root (labeled R
).
The following example illustrates the different linear tree representations used by the package:
Consider the secondary structure represented by the dot-bracket string (full tree)
.((..(((...)))..((..)))).
which is the most convenient condensed notation used by our programs and library functions.
Then, the following tree representations are equivalent:
((U)(((U)(U)((((U)(U)(U)P)P)P)(U)(U)(((U)(U)P)P)P)P)(U)R)
((U1)((U2)((U3)P3)(U2)((U2)P2)P2)(U1)R)
R
, without stem nodes S
): ((H)((H)M)R)
R
): (((((H)S)((H)S)M)S)R)
R
, with external nodes E
): ((((((H)S)((H)S)M)S)E)R)
R
, with external nodes E
): ((((((H3)S3)((H2)S2)M4)S2)E2)R)
The Expanded tree is rather clumsy and mostly included for the sake of completeness. The different versions of Coarse Grained Tree Representations are variatios of Shapiro's linear tree notation.
For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like:
a) (UU)(P(P(P(P(UU)(UU)(P(P(P(UU)(UU)(UU)P)P)P)(UU)(UU)(P(P(UU)(U... b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU) c) (B(M(HH)(HH)M)B) (S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S) (E(S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)E) d) (R(E2(S2(B1(S2(M4(S3(H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
Aligned structures additionally contain the gap character '_'.
Several functions are provided for parsing structures and converting to different representations.
char *expand_Full(const char *structure)
Convert the full structure from bracket notation to the expanded notation including root.
char *b2HIT (const char *structure)
Converts the full structure from bracket notation to the HIT notation including root.
char *b2C (const char *structure)
Converts the full structure from bracket notation to the a coarse grained notation using the 'H' 'B' 'I' 'M' and 'R' identifiers.
char *b2Shapiro (const char *structure)
Converts the full structure from bracket notation to the weighted coarse grained notation using the 'H' 'B' 'I' 'M' 'S' 'E' and 'R' identifiers.
char *expand_Shapiro (const char *coarse);
Inserts missing 'S' identifiers in unweighted coarse grained structures as obtained from b2C().
char *add_root (const char *structure)
Adds a root to an un-rooted tree in any except bracket notation.
char *unexpand_Full (const char *ffull)
Restores the bracket notation from an expanded full or HIT tree, that is any tree using only identifiers 'U' 'P' and 'R'.
char *unweight (const char *wcoarse)
Strip weights from any weighted tree.
void unexpand_aligned_F (char *align[2])
Converts two aligned structures in expanded notation.
void parse_structure (const char *structure)
Collects a statistic of structure elements of the full structure in bracket notation.