RNA secondary structure prediction consists of predicting the 2D fo...
The Weeks laboratory works on RNA structure prediction and has a ph...
Several illustrative examples of pseudoknots: ![Imgur](https://i.im...
There are over 250 software packages for RNA structure prediction. ...
Math background on pseudoknot prediction: http://math.mit.edu/clas...
This simple, concise equation is packed with information. The key t...
Computational RNA secondary prediction is empirically improved sign...
Riboswitches are regulatory segments of messenger RNA molecules tha...
Accurate SHAPE-directed RNA secondary structure
modeling, including pseudoknots
Christine E. Hajdin
a,1
, Stanislav Bellaousov
b,1
, Wayne Huggins
a
, Christopher W. Leonard
a
, David H. Mathews
b,2
,
and Kevin M. Weeks
a,2
a
Department of Chemistry, University of North Carolina, Chapel Hill, NC 27599-3290; and
b
Department of Biochemistry and Biophysics, and Center for RNA
Biology, University of Rochester Medical Center, Rochester, NY 14642
Edited by Ignacio Tinoco, University of California, Berkeley, CA, and approved February 5, 2013 (received for review November 15, 2012)
A pseudoknot forms in an RNA when nucleotides in a loop pair
with a region outside the helices tha t close the loop . Pseudok nots
occur relatively rarely in RNA but are highly overrepresented in
functionally critical mot ifs in large catalytic RNAs, in riboswitches,
andinregulatoryelementsofviruses. Pseudoknots are usually
excluded from RNA structure prediction algorithms. When included,
these pairings are difcult to model accurately, especially in large
RNAs, because allowing this structure dramatically increases the
number of possible incorrect folds and because it is difcult to
search the fold space for an optimal structure. We have developed
a concise secondary structure modeling approach that combines
SHAPE (selective 2-hydroxyl acylation analyzed by primer exten-
sion) experimental chemical probing information and a simple, but
robust, energy model for the entropic cost of single pseudoknot
formation. Structures are predicted with iterative renement, using
a dynamic programming algorithm. This melded experimental and
thermodynamic energy function predicted the secondary structures
and the pseudoknots for a set of 21 challenging RNAs of known
structure ranging in size from 34 to 530 nt. On average, 93% of
known base pairs were predicted, and all pseudoknots in well-
folded RNAs were identied.
thermodynamics
|
nearest neighbor parameters
|
circle plot
|
polymer model
|
1M7
R
NA constitutes the central information conduit in biology (1).
Information is encoded in an RNA molecule at two levels: in
its primary sequence and in its ability to form higher-order sec-
ondary and tertiary structures. Nearly all RNAs can fold to form
some secondary structure and, in many RNAs, highly structured
regions encode important regulatory motifs . Such structured
regulatory elements can be composed of canonical base pairs but
may also feature specialized and distinctive RNA structures.
Among the best characterized of these specialized structures are
RNA pseudoknots. Pseudoknots are relatively rare but occur
overwhelmingly in functionally important regions of RNA (24).
For example, all of the large catalytic RNAs contain pseudoknots
(5, 6); roughly two-thirds of the known classes of riboswitches
contain pseudoknots that appear to be essential for ligand binding
and gene regulatory functions (7); and pseudoknots occur prom-
inently in the regulatory elements that viruses use to usurp cellular
metabolism (3). Pseudokno ts are thus harbingers of biological
function. An important and challenging goal is to identify these
structures reliably.
Pseudoknots are excluded from the most widely used algo-
rithms that model RNA secondary structure (8). This exclusion is
based on the challenge of incorporating the pseudoknot struc-
ture into the efcient dynamic programming algorithm used in
the most popular secondary structure prediction approaches and
because of the additional computational effort required. The
prediction of lowest free energy structures with pseudoknots is
NP-complete (9), which means that lowest free energy structure
cannot be solved as a function of sequence length in polynomial
time. In addition, allowing pseudoknots greatly increases the
number of (incorrect) helices possible and tends to reduce sec-
ondary structure prediction accuracies, even for RNAs that in-
clude pseudoknots. Current algorithms also have high false-
positive rates for pseudoknot prediction, necessitating extensive
follow-up testing and analysis of proposed structures.
Pseudoknot prediction is challenging, in part, for the same
reasons that RNA secondary structure prediction is difcult.
First, energy models for loops are incomplete because they ex-
trapolate from a limited set of experiments. Second, folding can
be affected by kinetic, ligand-mediated, tertiary, and transient
interactions that are difcult or impossible to glean from the
sequence. Prediction is also difcult for a third reason unique to
pseudoknots: Energy models for pseudoknot formation are gen-
erally incomplete because the factors governing their stability are
not fully understood (1012). The result is that current algorithms
that model pseudoknots predict the base pairs in the simplest
pseudoknots (termed H-type, formed when bases in a loop region
bind to a single-stranded region), when the beginning and end of
the pseudoknotted structure are known, with accuracies of only
about 75% (10). Secondary structure prediction is much less ac-
curate for full-length biological RNA sequences, with as few as
5% of known pseudoknotted pairs predicted correctly and with
more false-positive than correct pseudoknot predictions in some
benchmarks (13).
The accuracy of secondary structure prediction is improved
dramatically by including experimental information as restraints
(14, 15). Selective 2-hydroxyl acylation analyzed by primer ex-
tension (SHAPE) probing data have proved especially useful in
yielding robust working models for RNA secondary structure
(15, 16). In essence, inclusion of SHAPE information provides
an experimental adjustment to the well-established, nearest-
neighbor model parameters (17) for RNA folding. This adjust-
ment is implemented as a simple pseudo-free energy change
term, ΔG°
SHAPE
. SHAPE reactivities are approximately inversely
proportional to the probability that a given nucleotide is base
paired (high reactivities correspond to a low likelihood of being
paired and vice versa) and the logarithm of a probability corre-
sponds to an energy, in this case ΔG°
SHAPE
, which has the form
ΔG8
SHAPE
= mln½SHAPE + 1 + b: [1]
The slope, m , corresponds to a penalty for base pairin g that
increases with the experimental SHAPE reactivity, and the in-
tercept, b,reects a favorable pseudo-free energy change term
for base pairing at nucleotides with low SHAPE reactivities.
These two parameters must be determined empirically. This
Author contributions: C.E.H., S.B., W.H., D.H.M., and K.M.W. designed research; C.E.H.,
S.B., W.H., and C.W.L. performed research; C.E.H., S.B., W.H., C.W.L., D.H.M., and K.M.W.
analyzed data; and C.E.H., S.B., D.H.M., and K.M.W. wrote the paper.
The authors declare no conict of interest.
This article is a PNAS Direct Submission.
Data deposition: Structure probing data have been deposited in the single nucleotide
resolution nucleic acid structure mapping (SNRNASM) community structure probing da-
tabase (snrnasm.bio.unc.edu).
1
C.E.H. and S.B. contributed equally to this work.
2
To whom correspondence may be addressed. E-mail: weeks@unc.edu or David_
Mathews@urmc.rochester.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1219988110/-/DCSupplemental.
54985503
|
PNAS
|
April 2, 2013
|
vol. 110
|
no. 14 www.pnas.org/cgi/doi/10.1073/pnas.1219988110
pseudo-free energy change approach yields high-quality sec-
ondary structure models for both short RNAs and those that
are kilobases long (15, 16).
Our original SHAPE -directed algorithm di d not allow for
pseudoknotted base pairs (15). Given the strong relationship
between pseudoknots and functiona lly critical regions in RNA
and the fact that it is i mpossible to know a priori whet her an
RNA contains a pseudoknot, this limitation severely restricts
the accuracy and generality of experimentally directed RNA
structure analysis. Here, we describe a concise approach for
applying SHAPE-directed RNA secondary structure modeling
to include pseudoknots, i n an algorithm we call S hapeKnots,
and we show th at the algorithm yi elds high-quality structures
for diverse RNA sequences.
Results
Challenging RNA Test Set. We developed the ShapeKnots algo-
rithm, using a test set of 16 nonpseudoknotted and pseudoknot-
containing RNAs that were selected for their complex, and
generally dif cult to predict, structures (Table 1, Top). These
RNAs included (i) 5 RNAs with lengths >300 nt, both with and
without pseudoknots; (ii) 5 riboswitch RNAs whose structures
form only upon binding by speci c ligands, for which thermo-
dynamic rules are obligatorily i ncomplete; (iii) 4 RNAs with
structures that are predicted especially poorly, with accuracies
<60% using nearest-neighbor thermodynamic parameters; and
(iv) 3 RNAs whose structures are probably modulated by protein
binding. SHAPE experiments were performed on each of the
RNAs in the presence of ligand if applicable but in the absence
of any protein. Each of the training set RNAs had SHAPE prob-
ing patterns that suggested these RNAs folded in solution into
structures generally consistent with accepted secondary structure
models based on either X-ray crystallography or comparative se-
quence analyses. The structures of the 16 RNAs in the test set
are predicted poorly by a conventional algorithm based on their
sequences alone: The average sensitivity (sens, fraction of base
pairs in the accepted structure predicted correctly), positive
predictive value (ppv, the fraction of predicted pairs that occur in
the accepted structure), and geometric average of these metrics
are 72%, 78%, and 74%, respectively (Table 1).
In the process of developing this training set, we also analyzed two
RNAsRNase P RNA and the human signal recognition particle
RNAwhose in vitro SHAPE reactivities were incompatible with
the accepted structures for these RNAs. We include prediction
statistics for these RNAs (Table 1, Bottom) but did not use these to
evaluate our SHAPE-directed modeling algorithm.
Simple, Robust Model for Pseudoknot Formation. The favorable
energetic contributions for forming the helices that comprise a
pseudoknot are likely to be predicted accurately by the Turn er
Table 1. Prediction accuracies as a function of algorithm and SHAPE information
Sensitivities (sens), positive predictive value (ppv), and their geometric average (geo) are shown for four test cases: no pseudoknots allowed and no SHAPE
data, no pseudoknots allowed and with SHAPE data (both by free energy minimization), pseudoknots allowed and no SHAPE data, and pseudoknots allowed
and with SHAPE data (both using ShapeKnots). Complicating features are ligand (L) binding and protein (P) binding that are not accounted for in nearest-
neighbor thermodynamic parameters. Pseudoknot (PK) predictions are indicated with a checkmark () or an X; a checkmark indicates that pseudoknots were
predicted correctly and that there were no false-positive pseudoknot predictions. For the ribosomal RNAs (), regions in which the SHAPE reactivities were
clearly incompatible with the accepted structure, as described in ref. 15, were omitted from the sensitivity and ppv calculations; for the E. coli 16 rRNA, this
included nucleotides 143220. The HIV-1 5 leader domain (§) was included as an example of pseudoknot prediction in a large RNA. Because the accepted
structure for this RNA is based on SHAPE-directed prediction (24), we did not include sensitivity and ppv for this RNA in the overall average values; however,
the pseudoknot was proved independently (23) and is included.
Hajdin et al. PNAS
|
April 2, 2013
|
vol. 110
|
no. 14
|
5499
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY
nearest-neighbor model (17, 18) when modied by the experi-
mental ΔG°
SHAPE
term (Eq. 1). In addition, pseudoknot for-
mation must overcome an entro pic p enalty; these energetics are
difcult to estimate. The most widely used models are complex
and include a large number of constituent parameters (11, 12).
We adopted a simple approach to estimate the entropies on the
basis of three primary insights. First, any secondary structure
prediction must ultimately be compatible with a specic, ener-
getically favorable, fold in the RNA in which nucleotides that
base pair in the pseudoknot are close in three-dimensional
space. This fundamental close-in -space feature must a lso be
recapitulated in secondary structure p rediction.
We modeled RNA pseudoknots as the sum of simple distance
features or beads. There are exactly three possibilities for the
structures that compose a pseudoknot: single-stranded nucleo-
tides, nested helices, and in-line helices (Fig. 1). Duplexes con-
taining single-nucleotide bulges are counted as a single helix. This
model emphasizes structures rather than topologies and appears
to be compatible with the vast majori ty of known pseudoknots.
In essence, energetically favorable pseudoknots feature a small
number of the single-stranded, nested hel ix, and in-line helix
beads. Second, to account for the number of constituent single-
stranded (SS) nucleotides and nested (NE) helices (Fig. 1), we
adopted a simple polymer physics-based model (19). The energetic
penalty associated with each of these features is weighted by dis-
tances of e = 6.5 Å and f = 15 Å, the mean lengths of a single-
stranded nucleotide and a nested helix element, respectively (19)
(Fig. 1). Finally, we created a penalty for in-line (IL) helices (Fig.
1). The potential to form these structures is weighted by their end-
to-end length (n) in the context of A-form helix geometry and the
distribution of in-line helices in RNAs of known structure. The
model for th e entropic c ost of pseudoknot formation, Δ
PK
,
has two adjustable parameters, P1 and P2,
ΔG8
PK
= P1ln
e
2
SS + f
2
NE
+ P2lnΣILðnÞ
λ
2
n
; [2]
where λ
n
is the penalty constant for in-line helices of length n
(Table S1). The rst term penalizes formation of pseudoknots with
long single-stranded regions and many nested helices, whereas the
second term enforces an optimal geometry for in-line helices.
RNA Structure Interrogation by SHAPE. Most RNAs were transcribed
in vitro and contained short hairpin-containing structure cassettes
at their 5 and 3 ends (20). The 16S and 23S ribosomal RNAs
were isolated from total Escherichia coli or Haloferax volcanii RNA
(15). The transcribed RNAs were folded in a standard buffer with
physiologically relevant ion concentrations (and saturating ligand
concentrations for riboswitches) and treated with 1-methyl-7-nitro-
isatoic anhydride (1M7) (21). Sites of 2-O-adduct formation were
detected by primer extension, using a previously described high-
throughput SHAPE approach (20). SHAPE reactivities were nor-
malized to place them on a scale from zero (unreactive) to 1.5
(highly reactive). In this work, we illustrate modeling results in the
form of circle plots, which provide an unbiased way to visualize
correct and incorrect base pairs (Fig. 2). The nucleotide sequence is
arrayed on the outer circle: Unreactive nucleotides (SHAPE
reactivities <0.4) are colored black, moderately reactive nucleotides
(0.40.85) are yellow, and highly reactive nucleotides (>0.85) are
red. Base pairs are shown as arcs, colored by whether they are
predicted correctly or not (Fig. 2, Left). Pseudoknots correspond to
helices whose arcs cross in the circle plot. In general, there was
a strong correspondence between SHAPE reactivities and the
pattern of base pairing in the accepted structures. Nucleotides
that participate in canonical base pairs were generally unreac-
tive; whereas nucleotides in loops, bulges, and other connecting
regions were reactive (Fig. 2, Center and Right).
Algorithm and Parameter Determination. Our ShapeKnots algo-
rithm has four unde rlying parameters: m and b used in calcu-
lation of Δ
SHAPE
and P1 and P2 used to calculate Δ
PK
from
Eqs. 1 and 2, respectively. The Δ
SHAPE
parameters, m and b,
penalize or favor base pairs with high and low SHAPE reactivities,
respectively, are universal to all RNAs, and do not directly con-
tribute to the entropic penalty for pseudoknot formation. These
parameters can thus be t independently of the Δ
PK
terms, P1
and P2. m and b were optimized using the seven RNAs in our
training dataset that do not contain pseudoknots. To reduce
overoptimization of these parameters, we used a leave-one-out
jackknife approach (22) to assess prediction sensitivities, ppv, and
the geometric mean of these parameters at each grid point for
seven quasi-independent data sets, each containing six of the
seven RNAs.
Our algorithm for identication of pseudoknots follows the
approach implemented in HotKnots (10). A two-stage renement
rst nds stable helices, using a dynamic programming algorithm
5
3
Single stranded
e = 6.5 Å
Nested helix
In-line helix
λ
f = 15
Å
1
N
n
Fig. 1. Overview of pseudoknot structure model and entropic penalty
terms. Length features are incorporated into Δ
PK
as described in Eq. 2.
Energy penalties for single-stranded nucleotides and nested helices are
based on a previously developed model (19); the penalty for in-line helices
was developed in this work.
Fig. 2. Representative ShapeKnots structure prediction
for the SAM I riboswitch. Base pair predictions are illus-
trated with colored lines: green, correctly predicted; red,
missed base pair relative to the accepted (29) structure;
and purple, prediction of a pair not in the accepted
structure. (Left) Predictions without SHAPE data. (Center
and Right) Predictions made when SHAPE data were in-
cluded, using circle plot and conventional representa-
tions, respectively. Sensitivity (sens) and ppv are listed for
each structure. SHAPE data are shown as colored nucle-
otide letters on a black, yellow, and red scale for low,
medium, and high SHAPE reactivities, respectively. Plots
were generated using the CircleCompare program in the
RNAstructure package.
5500
|
www.pnas.org/cgi/doi/10.1073/pnas.1219988110 Hajdin et al.
that does not allow pseudoknots. The second stage uses the
same dynamic programming algorithm to predict structures for
each stable helix found in stage one. In stage two, structures are
predicted such that nucleotides in the stable helix are forced to
not pair. These pairs are subsequently added back to the struc-
ture, and these helices can therefore be pseudoknotted. This
allows the prediction of up to one pseudoknot per run. Run
times for the nal ShapeKnots algorithm were less than 1 min for
RNAs of fewer than 150 nt and 90 min for the longest (530 nt)
RNA (Table S2).
The pseudoknot-specic parameters, P1 and P2, were t using
a jackknife approach incorporating data from all 16 RNAs in the
training set. Parameters were optimized in three stages (Methods).
In this analysis, m = 1.8 and b = 0.6 kcal/mol yielded the most
accurate secondary structure predictions (Fig. S1). These param-
eters differ slightly from the values (m = 2.6 and b = 0.8 kcal/mol)
determined previously using only E. coli 23S rRNA (15). We rec-
ommend use of these new values for RNA structure prediction
both with and without pseudoknots. Applying ShapeKnots using
these Δ
SHAPE
and Δ
PK
parameters yielded an average sensi-
tivity for secondary structure prediction of 93% for the 16 RNAs in
the test set (Table 1, Top).
Extension to Additional RNAs. We used ShapeKnots to model
secondary structures for six RNAs that were not u sed to op-
timize the nal algorithm. Three RNAsthe adenine ribos-
witch, tRNA
Phe
,andE. coli 5S rRNAwere chosen because prior
approaches using nonstandard data analysis suggested that they
folded poorly with SHAPE data (16). The other three RNAs
the uoride riboswitch pseudoknot, the 5 domain of the H. vol-
canii 16S rRNA, and the 5 pseudoknot leader of the HIV-1
RNA genomeadopt structures that are predicted poorly by
conventional approaches. Overall prediction sensitivities for these
six RNAs were 95% (Table 1, Middle), and the pseudoknots in
the HIV-1 and uoride riboswitch RNAs (2325) were identied
correctly.
Discussion
Pseudoknots are relatively rare in large RNAs but are highly over-
represented in important functional regions (2, 3, 6, 7). Despite
their importance, the most commonly used RNA structure pre-
diction algorithms do not permit pseudoknots because allowing
pseudoknots increases both algorithmic complexity and the number
of possible structures. Current algorithms that allow pseudoknots
recover only 70% of the total accepted base pairs. The prediction
sensitivity for base pairs that specically form pseudoknots varies by
algorithm and benchmark RNAs but averages only 540%, with
many false-positivepredictions(ref. 13 and Tables S3 and S4). Thus,
the current generation of pseudoknot prediction algorithms is
poorly suited for designing testable biological hypotheses.
ShapeKnots combines an iterative pseudoknot discovery algo-
rithm with experimental SHAPE information and a simple energy
model for the entropic cost of pseudoknot formation. The pseu-
doknot penalty in ShapeKnots has only two adjustable parameters
(Fig. 1 and Eq. 2) that limit formation of pseudoknots w ith long
single-stranded regions and many nested h eli ces and that en-
force an optimal geometry for in-line helices. ShapeKnots also
allows incorporation of an experimental correction to stan-
dard free energy terms. Including SHAPE data both l imits the
number of possible structures and provides info rmation that
accounts for hidden features that stabilize RNA folding, in-
cluding the signicant effects of metal ion and ligand binding.
Our set of training structures was composed of 16 RNAs of
known structure that ranged in length from 34 to 530 nt; pseu-
doknots occur in 9 of the 16 RNAs. Prediction accuracies were
consistently high (Table 1 and Dataset S1). ShapeKnots signi-
cantly outperformed currently available pseudoknot prediction
algorithms and is the only algorithm to achieve >90% overall
and pseudoknot-specic sensitivities with this test set (Tables S3
and S4; see Methods for additional discussion). Both the specic
pseudoknot energy penalty and use of SHAPE data contribute
to the accuracy of the ShapeKnots approach. It is likely that
inclusion of SHAPE data will generally improve accuracies for
pseudoknot prediction algorithms.
We summarize our modeling results by emphasizing four classes
of RNA: (i) short pseudoknotted RNAs with structures that
ShapeKnots predicts very accurately; (ii) large, challenging RNAs
that ShapeKnots predicts with good accuracy; (iii)RNAswithhigh
likelihood of being mischaracterized with false-positive or
missed pseudoknots that ShapeKnots pred icts accurately; and
(iv ) RNAs that interact with other molecules such as ligands,
proteins, and metal ions that pose unique challenges. For most
RNAs analyzed here, differences between models generated by
ShapeKnots and currently accepted structures were minor and
typically involved short-range i nteractions or base pair s at the
ends of helices. In some cases, differences likely reect ther-
modynamically accessible states at equilibrium in solution.
Short Pseudoknotted RNAs. The rst class includes small RNAs
that contain H-type pseudoknots: the pre-Q1 riboswitch, human
Fig. 3. Summary of predictions for four H-type pseudoknots. Base pair
predictions are illustrated as outlined in Fig. 2; sensitivity (sens) and ppv are
listed for each structure. Left and Right columns show predictions for
a conventional mfold-class algorithm vs. ShapeKnots (with experimental
SHAPE restraints).
Hajdin et al. PNAS
|
April 2, 2013
|
vol. 110
|
no. 14
|
5501
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY
telomerase, the uoride riboswitch, and a severe acute respiratory
syndrome (SARS) corona virus domain. Because the most com-
monly used dynamic programming algorithms cannot predict base
pairs in an H-type pseudoknot, prediction sensitivities using a con-
ventional algorithm (14) were quite poor; in contrast, ShapeKnots
yielded perfect or near-perfect predictions in each case (Fig. 3,
compare Left and Right columns). The only ShapeKnots-predicted
base pairs that do not occur in the accepted structures involve sets of
2 or fewer bp located at the ends of individual helices in the uoride
riboswitch and the SARS domain. These results suggest that
ShapeKnots prediction of H-type pseudoknots in short RNAs
is robust.
Large, Complex RNAs. The second class includes large RNAs that
do not require ligands or protein cofactors for correct folding.
Large RNAs pose a challenge to modeling algorithms due to the
vast number of possible structures and due to the large number
of structures with similar folding free energies changes. For ex-
ample, in the absence of experimental structure probing data,
two representative RNAs, the Azoarcus group I intron and the
hepatitis C virus internal ribosome entry sequence (IRES) do-
main, are predicted with sensitivities of 73% and 39%, re-
spectively. Mispredictions occur primarily in two hairpin motifs
in the Azoarcus RNA but span essentially the entire hepatitis C
virus (HCV) IRES RNA (Fig. 4). Inclusion of SHAPE data
yielded near-perfect predictions in each case, including correct
identication of the pseudoknot in each RNA (Fig. 4, compare
Left and Right columns).
RNAs with Difcult to Predict Pseudoknots. W ithin a given RNA
sequence, several physically reasonable pseudoknots are often
possi ble, for example, in the SARS v irus domain (Fig. 5, Upper
Left, arrow li nking pu rple and red helices). Conversely, as ex-
emplied by the SAM I riboswitch, pseudokn ots can be missed
because the energy function does not dist inguish small differ-
ences in stabilities of a pseudoknot-forming vs. a more local
helix (Fig. 5, Lower Left, arro w). The experimental SH APE-
based correction correctly reranked the stabilities for the two
possible helices located close to one another in topological space
in the SARS and riboswitch RNAs, ultimately avoiding both
false-positive and false-negative pseudoknot predictions (Fig. 5,
Right column).
RNAs That Do Not Adopt Their Accepted Structures. During our
analysis of experimentally directed structure modeling, we examined
two RNAs for which the in vitro SHAPE data were clearly in-
compatible with the accepted structure. These RNAs were the signal
recognition particl e RNA and RNase P. In each case, the SHAPE-
directed model using ShapeKnots provided a signicant improve-
ment relative to the pseudoknot-free lowest free energy predicted
structure (Table 1, Bottom). Nonetheless, a large part of each
structure was mispredicted relative to the accepted structure. In
each case, nucleotides in some helices in the accepted structural
model were reactive by SHAPE, suggesting that these helices do not
form under the solution conditions used here for in vitro structure
probing (Fig. S2). There are several possible explanations for the
observed discrepancies. First, the conditions under which these
RNAs were crystallized are different from the roughly physiological
ion conditions used in SHAPE probing experiments. The differences
in conditions could cause the crystallographic structure to be dif-
ferent from that in solution or there may be structural inhomogeneity
in solution. Second, both the RNase P and signal recognition particle
RNAs function as RNAprotein complexes. These proteins were not
present during in vitro SHAPE experiments.
C
1
U
C
A
U
A
U
U
U
C
1
0
G
A
U
G
U
G
C
C
U
U
20
G
C
G
C
C
G
G
G
A
A
30
A
C
C
A
C
G
C
A
A
G
40
G
G
A
U
G
G
U
G
U
C
50
A
A
A
U
U
C
G
G
C
G
60
A
A
A
C
C
U
A
A
G
C
70
G
C
C
C
G
C
C
C
G
G
80
G
C
G
U
A
U
G
G
C
A
90
A
C
G
C
C
G
A
G
C
C
1
0
0
A
A
G
C
U
U
C
G
G
C
110
G
C
C
U
G
C
G
C
C
G
1
2
0
A
U
G
A
A
G
G
U
G
U
13
0
A
G
A
G
A
C
U
A
G
A
1
40
C
G
G
C
A
C
C
C
A
C
150
C
U
A
A
G
G
C
A
A
A
160
C
G
C
U
A
U
G
G
U
G
17
0
A
A
G
G
C
A
U
A
G
U
180
C
C
A
G
G
G
A
G
U
G
190
G
C
G
A
A
A
G
U
C
A
200
C
A
C
A
A
A
C
C
G
G
2
1
0
A
A
U
C
214
conventional, no data ShapeKnots
C
1
C
A
U
G
A
A
U
C
A
10
C
U
C
C
C
C
U
G
U
G
2
0
A
G
G
A
A
C
U
A
C
U
30
G
U
C
U
U
C
A
C
G
C
4
0
A
G
A
A
A
G
C
G
U
C
5
0
U
A
G
C
C
A
U
G
G
C
60
G
U
U
A
G
U
A
U
G
A
70
G
U
G
U
C
G
U
G
C
A
8
0
G
C
C
U
C
C
A
G
G
A
9
0
C
C
C
C
C
C
C
U
C
C
100
C
G
G
G
A
G
A
G
C
C
110
A
U
A
G
U
G
G
U
C
U
120
G
C
G
G
A
A
C
C
G
G
1
30
U
G
A
G
U
A
C
A
C
C
1
4
0
G
G
A
A
U
U
G
C
C
A
1
5
0
G
G
A
C
G
A
C
C
G
G
1
6
0
G
U
C
C
U
U
U
C
U
U
170
G
G
A
U
U
A
A
C
C
C
1
8
0
G
C
U
C
A
A
U
G
C
C
19
0
U
G
G
A
G
A
U
U
U
G
2
0
0
G
G
C
G
U
G
C
C
C
C
2
1
0
C
G
C
G
A
G
A
C
U
G
220
C
U
A
G
C
C
G
A
G
U
230
A
G
U
G
U
U
G
G
G
U
240
C
G
C
G
A
A
A
G
G
C
25
0
C
U
U
G
U
G
G
U
A
C
26
0
U
G
C
C
U
G
A
U
A
G
270
G
G
U
G
C
U
U
G
C
G
280
A
G
U
G
C
C
C
C
G
G
29
0
G
A
G
G
U
C
U
C
G
U
3
0
0
A
G
A
C
C
G
U
G
C
A
310
U
C
A
U
G
A
G
C
A
C
3
2
0
G
A
A
U
C
C
U
A
A
A
330
C
C
U
C
A
A
336
Hepatitis C virus IRES domain (336 nts)
Azoarcus group I intron (214 nts)
sens: 73%
ppv: 75
C
1
U
C
A
U
A
U
U
U
C
1
0
G
A
U
G
U
G
C
C
U
U
20
G
C
G
C
C
G
G
G
A
A
30
A
C
C
A
C
G
C
A
A
G
40
G
G
A
U
G
G
U
G
U
C
50
A
A
A
U
U
C
G
G
C
G
60
A
A
A
C
C
U
A
A
G
C
70
G
C
C
C
G
C
C
C
G
G
80
G
C
G
U
A
U
G
G
C
A
90
A
C
G
C
C
G
A
G
C
C
1
0
0
A
A
G
C
U
U
C
G
G
C
11
0
G
C
C
U
G
C
G
C
C
G
1
2
0
A
U
G
A
A
G
G
U
G
U
1
3
0
A
G
A
G
A
C
U
A
G
A
1
40
C
G
G
C
A
C
C
C
A
C
150
C
U
A
A
G
G
C
A
A
A
160
C
G
C
U
A
U
G
G
U
G
17
0
A
A
G
G
C
A
U
A
G
U
180
C
C
A
G
G
G
A
G
U
G
190
G
C
G
A
A
A
G
U
C
A
200
C
A
C
A
A
A
C
C
G
G
2
1
0
A
A
U
C
214
C
1
C
A
U
G
A
A
U
C
A
10
C
U
C
C
C
C
U
G
U
G
2
0
A
G
G
A
A
C
U
A
C
U
30
G
U
C
U
U
C
A
C
G
C
4
0
A
G
A
A
A
G
C
G
U
C
5
0
U
A
G
C
C
A
U
G
G
C
60
G
U
U
A
G
U
A
U
G
A
70
G
U
G
U
C
G
U
G
C
A
8
0
G
C
C
U
C
C
A
G
G
A
9
0
C
C
C
C
C
C
C
U
C
C
100
C
G
G
G
A
G
A
G
C
C
110
A
U
A
G
U
G
G
U
C
U
120
G
C
G
G
A
A
C
C
G
G
1
30
U
G
A
G
U
A
C
A
C
C
1
4
0
G
G
A
A
U
U
G
C
C
A
1
5
0
G
G
A
C
G
A
C
C
G
G
1
6
0
G
U
C
C
U
U
U
C
U
U
170
G
G
A
U
U
A
A
C
C
C
1
8
0
G
C
U
C
A
A
U
G
C
C
1
9
0
U
G
G
A
G
A
U
U
U
G
200
G
G
C
G
U
G
C
C
C
C
2
1
0
C
G
C
G
A
G
A
C
U
G
220
C
U
A
G
C
C
G
A
G
U
230
A
G
U
G
U
U
G
G
G
U
240
C
G
C
G
A
A
A
G
G
C
2
5
0
C
U
U
G
U
G
G
U
A
C
2
6
0
U
G
C
C
U
G
A
U
A
G
270
G
G
U
G
C
U
U
G
C
G
280
A
G
U
G
C
C
C
C
G
G
2
9
0
G
A
G
G
U
C
U
C
G
U
3
0
0
A
G
A
C
C
G
U
G
C
A
310
U
C
A
U
G
A
G
C
A
C
3
2
0
G
A
A
U
C
C
U
A
A
A
330
C
C
U
C
A
A
336
sens: 92%
ppv: 95
sens: 92%
ppv: 96
sens: 39%
ppv: 38
Fig. 4. Prediction summaries for two large, pseudoknot-containing RNAs.
Structural annotations are as described in Fig. 2.
Fig. 5. Representative examples in whic h ShapeKnots avoids false-pos-
itive (Upper) or false-negative (Lower) pseudokno t predict ions. Left and
Right columns sh ow the resul ts of ShapeKnots predictions without and
with SHAPE data, respectively. Arrows in the Left column emphasize the
replacement o f a n accep ted ( red) helix with an incorrect (purple) helix in
the absence of data. Other struct ural annotatio ns are as described in
Fig. 2.
5502
|
www.pnas.org/cgi/doi/10.1073/pnas.1219988110 Hajdin et al.
Perspective. It is difcult to account for many factors that impact
RNA secondary structureincluding effects of metal ions, ligands,
and protein bindingusing a system based on thermodynamic or
structural parameters. For example, the M-Box and uoride
riboswitch RNAs undergo large conformational changes upon
binding by Mg
2+
or F
ions, respectively (25, 26), and b inding of
ligands to the pre-Q1, TPP, cyclic-di-GMP, SAM, and adenine
riboswitches provides a large fraction of the total interactions that
ultimately stabilize the accepted structure (7). In addition, many of
the RNAs in our dataset contain base triple interactions, which are
common in pseudoknots (27). With the inclusion of SHAPE data,
the ShapeKnots approach does a good job of modeling these
interactions (Table 1).
Other challenges to structure prediction are that some base
pairs may be stable only in the presence of bound proteins and
some RNAs, especially as exemplied by riboswitches (7), sample
multiple conformations. Finally, in vitro refolding and probing
protocols may not fully recapitulate the functional or in vivo
structure. Our analyses of the signal recognition particle RNA
and RNase P illustrate these challenges: Neither of these RNAs
appears to fold stably to the accepted structure under solution
conditions used in this work (Fig. S2). These two RNAs are widely
used to benchmark folding algorithms, even though they may fold
robustly to their accepted structures only in the context of their
native RNAprotein complexes. In this case, for the specic so-
lution environment used here, the SHAPE-directed structures
appear to be roughly correct but just not the expected ones.
In the context of the diverse RNAs examined in this work, the
ShapeKnots algorithm recovered 93% of accepted base pairs in well-
folded RNAs (Table 1), signicantly outperforming current algo-
rithms. Nonetheless, evaluation of ShapeKnots is currently restricted
by chall enges that impact the entire RNA structure modeli ng eld
(16). Relatively few RNAs with nontrivial structures exist that are
known at a high level of condence. The ShapeKnots energy pen-
alty and search algorithm may require adjustment as new
pseudoknot topologies are discovered. RNAs that have been
solved by crystallography have features that make them simultaneously
both more and less difcult to predict than more typical structures:
They tend to contai n a relatively high level of noncanonical and
complex tertiary interactions (difcult to predict features), and
they fold into structures with many stable base-paired regions
(more readily pr edicted using thermodynam ics-based algo-
rithms). In addition, the structures inferred from high-resolution
data may not represent the solution conformation of the puried
RNAs. For RNAs in whic h the accepted structure is based on
phylogenetic and in-solution evidenceas exemplied by the
SARS virus and HCV IRES domainsShapeKnots predictions
may identify correct features missed in current accepted struc-
tures. The approaches outlined in this workuse of simple
models for base pairing and pseudoknot formation, including ex-
perimental corrections to thermodynamic parameters, and nu-
anced interpretation of differences between current accepted and
modeled structuresrepresent a critical departure point for fu-
ture accurate RNA secondary structure modeling.
Methods
Detailed descriptions of the ShapeKnots algorithm, parameterization of
ΔG°
SHAPE
and ΔG°
PK
, and SHAPE probing experiments are provided in SI
Methods. For the general user community, the current best parameters for
SHAPE-directed structure modeling (for algorithms that both do and do not
allow pseudoknots) are m = 1.8, b = 0.6, P1 = 0.35, and P2 = 0.65 kcal/mol
(Eqs. 1 and 2). It is critical that SHAPE experiments be processed accurately to
obtain highest-quality structure models (16). We recommend normalizing
SHAPE data by a model-free box-plot (15) approach and dening the borders
for low, medium, and high SHAPE reactivities (Fig. 2, black, yellow, and red) at
0.40 and 0.85 (see SI Methods for additional details). All SHAPE data used in
this work are available at www.chem.unc.edu/rna and at the SNRNASM
community structure probing database (28). ShapeKnots is freely available as
part of the RNAstructure software package at http://rna.urmc.rochester.edu.
ACKNOWLEDGMENTS. We thank Steve Busan and Ge Zhang for performing
SHAPE experiments and Gregg Rice for insightful discussions. This work was
supported by Grants AI068462 (to K.M.W.) and GM076485 (to D.H.M.) from
the National Institutes of Health.
1. Sharp PA (2009) The centrality of RNA. Cell 136(4):577580.
2. Staple DW, Butcher SE (2005) Pseudoknots: RNA structures with diverse functions.
PLoS Biol 3(6):e213.
3. Brierley I, Pennell S, Gilbert RJ (2007) Viral RNA pseudoknots: Versatile motifs in gene
expression and replication. Nat Rev Microbiol 5(8):598610.
4. Pleij CW (19 90) Pseudok nots: A new motif in the RNA game. Trends Biochem Sci 15(4):143147.
5. Powers T, Noller H F (1991) A fun ctional pseudoknot in 16S ribosomal RNA. EMBO J
10(8):22032214.
6. Reiter NJ, Chan CW, Mondragón A (2011) Emerging structural themes in large RNA
molecules. Curr Opin Struct Biol 21(3):319326.
7. Roth A, Breaker RR (2009) The structural and functional diversity of metabolite-
binding riboswitches. Annu Rev Biochem 78:305334.
8. Liu B, Mathews DH, Turner DH (2010) RNA pseudoknots: Folding and nding. F1000
Biol Rep 2:8.
9. Lyngsø RB, Pedersen CN (2000) RNA pseudoknot prediction in energy-based models.
J Comput Biol 7(34):409427.
10. Ren J, Rastegari B, Condon A, Hoos HH (2005) HotKnots: Heuristic prediction of RNA
secondary structures including pseudoknots. RNA 11(10):14941504.
11. Dirks RM, Pierce NA (2004) An algorithm for computing nucleic acid base-pairing
probabilities including pseudoknots. J Comput Chem 25(10):12951304.
12. Andronescu MS, Pop C, Condon AE (2010) Improved free energy parameters for RNA
pseudoknotted secondary structure prediction. RNA 16(1):2642.
13. Bellaousov S, Mathews DH (2010) ProbKnot: Fast prediction of RNA secondary
structure including pseudoknots. RNA 16(10):18701880.
14. Mathews DH, et al. (2004) Incorporating chemical modication constraints into a dy-
namic programming algorithm for prediction of RNA secondary structure. Proc Natl
Acad Sci USA 101(19):72877292.
15. Deigan KE, Li TW, Mathews DH, Weeks KM (2009) Accurate SHAPE-directed RNA
structure determination. Proc Natl Acad Sci USA 106(1):97102.
16. Leonard CW, et al. (2013) Principles for understanding the accuracy of SHAPE-directed
RNA structure modeling. Biochemistry 52(4):588595.
17. Turner DH, Mathews DH (2010) NNDB: The nearest neighbor parameter database for
predicting stability of nucleic acid secondary structure. Nucleic Acids Res 38(Database
issue):D280
D282.
18. Xia T, et al. (1998) Thermodynamic parameters for an expanded nearest-neighbor
model for formation o f RNA duplexes with Watson-Crick b ase p airs. Biochemistry
37(42):1471914735.
19. Aalberts DP, Nandagopal N (2010) A two-length-scale polymer theory for RNA loop
free energies and helix stacking. RNA 16(7):13501355.
20. Wilkinson KA, Merino EJ, Weeks KM (2006) Selective 2-hydroxyl acylation analyzed
by primer extension (SHAPE): Quantitative RNA structure analysis at single nucleotide
resolution. Nat Protoc 1(3):16101616.
21. Mortimer SA, Weeks KM (2007) A fast-acting reagent for accurate analysis of RNA sec-
ondary and tertiary structure by SHAPE chemistry. JAmChemSoc129(14):41444145.
22. Tukey JW (1958) Bias and condence in not quite large samp les. AnnMathStat
29:614.
23. Paillart JC, Skripkin E, Ehresmann B, Ehresmann C, Marquet R (2002) In vitro evidence
for a long range pseudoknot in the 5-untranslated and matrix coding regions of HIV-
1 genomic RNA. J Biol Chem 277(8):59956004.
24. Wilkinson KA, et al. (2008) High-throughput SHAPE analysis reveals structures in HIV-
1 genomic RNA strongly conserved across distinct biological states. PLoS Biol 6(4):e96.
25. Ren A, Rajashankar KR, Patel DJ (2012) Fluoride ion encapsulation by Mg2+ ions and
phosphates in a uoride riboswitch. Nature 486(7401):8589.
26. Dann CE, 3rd, et al. (2007) Structure and mechanism of a metal-sensing regulatory
RNA. Cell 130(5):878892.
27. Cao S, Giedroc DP, Chen SJ (2010) Predicting loop-helix tertiary structural contacts in
RNA pseudoknots. RNA 16(3):538552.
28. Rocca-Serra P, et al. (2011) Sharing and archiving nucleic acid structure mapping data.
RNA 17(7):12041212.
29. Montange RK, Batey RT (2006) Structure of the S-adenosylmethionine riboswitch
regulatory mRNA element. Nature 441(7097):11721175.
Hajdin et al. PNAS
|
April 2, 2013
|
vol. 110
|
no. 14
|
5503
BIOPHYSICS AND
COMPUTATIONAL BIOLOGY

Discussion

Math background on pseudoknot prediction: http://math.mit.edu/classes/18.417/Slides/RNA-pseudoknots.pdf RNA secondary structure prediction consists of predicting the 2D folds of RNA. These 2D folds are determined by the base pairing of RNA nucleotides (hydrogen bonds). RNA secondary structure is important because 3D structure and function is largely shaped by secondary structure. For a good primer on RNA structure: https://www.youtube.com/watch?v=WCrlm18KQ48 Several illustrative examples of pseudoknots: ![Imgur](https://i.imgur.com/UOpPEQ0.png) There are over 250 software packages for RNA structure prediction. Most of the methods involve Dynamic Programming and an energy minimization approach (assuming minimum energy state(s) are occupied). Dynamic Programming is widely used in bioinformatics for sequence alignment, protein folding, RNA structure prediction, and protein-DNA binding. The basic idea is that you break a complex problem into smaller, simpler subproblems, store the smaller subproblem solutions and use previously solved subproblems to find the optimal overall solution. Energy Minimization function: ΔG folding = ΔG unfolded - ΔG folded Riboswitches are regulatory segments of messenger RNA molecules that can change the production of the proteins encoded by the messenger RNA they are a part of when they bind certain small molecules. https://en.wikipedia.org/wiki/Riboswitch Computational RNA secondary prediction is empirically improved significantly by constraining predictions with experimental data from structure-sensitive enzymatic cleavage and chemical probing reagents. Selective 2′-Hydroxyl Acylation analyzed by Primer Extension (SHAPE) chemical probing technology yields quantitative reactivity information for almost every nucleotide in an RNA, regardless of RNA size and when combining this SHAPE structural information with a thermodynamics-based dynamic programming algorithm, the results are a much more accurate secondary structure model. The Weeks laboratory works on RNA structure prediction and has a philosophy/track-record of designing concise, elegant models for RNA structure prediction. Lab website: http://www.chem.unc.edu/rna/. Article about philosophy: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4446244/ This simple, concise equation is packed with information. The key takeaways are that it can be interpreted as an energy function because it contains the logarithm of a probability and that the higher the SHAPE reactivity of a nucleotide, the lower the stability of that nucleotide (less likely it is going to be hydrogen-bond base-paired) and the higher its contribution to the overall energy fold change.