Structure Prediction

Structure Prediction dmitra

Methods • Ab initio • Heuristics • Machine learning • Homology modeling • Threading

RNA Structure Prediction: Ab-initio • Sequence over {A, C, G, U} • Complementary pairs attract, form base-pairs or minimizes energy • We are not interested in overall energy of the sequence, just the process of minimization • Just the linear sequence, zero base pairs, energy=0 • Physics is embedded within “free-energy” parameter/function • Minimization of energy is objective

RNA Structure Prediction: Knot-free • Knot-free assumption • Knot: base pairs (I, j) and (k, l) where I<j<k<l • Knot-free causes planar graph, and makes DP algorithm feasible • Base pairs are disjoint or embed in each other

RNA Structure Prediction: Principle of optimality • Assumption 1: Base-pairing do not affect each other’s energy • Now one can add energy minimization by all base pairs in a string and check which configuration produces lowest energy • Combinatorics is exponential • Need further assumption

RNA Structure Prediction: DP Algorithm • Assume energy for each component can be calculated independently • a(r,k): free energy for base pair (r,k), where r, k from ACGU • a is zero for self-pairing (impossible)

RNA Structure Prediction: DP Algorithm • E(Sij)= min{ E(SI+1,j-1 ) + a(ri,rj), when i,j pairs, Min{E(SI,k-1) + E(Sk+1,j )}, when j pairs with k, I<k=<j} • Compute (n x n) matrix for I and j, bottom up, for I-j=0, I-j=1, I-j=2,… • Complexity: O(n^3)

RNA Structure Prediction: relax assumptions • Consider some special energy functions, other than just the base pairing ones a(r,k) • This means: different “types” of base pairings • Some more practical topology

RNA Structure Prediction: Loops • Say, base pair at (I,j) and I<u<v<w<j • v is accessible from base pair (I,j) if there is no base pair at (u,v) • Loop is the bases accessible from base pair (I,j) • Note, still no knot • Some loops: p249

RNA Structure Prediction: Energy overloops • Say, (I,j) base pair closes a loop • Si+1,j-1 may not have the minimum energy configuration • Because energy of Si+1,j-1 plus free energy of a(ri,rj) may be less than min-energy configuration of string (I+1 to j-1) without base pairing at (I,j) • This interactive-ness was ignored at the previous assumption level • Dynamic Programming can still be done, if we explicitly specify energy parameters

RNA Structure Prediction: Energy overloops • E(Sij)= min{ E(SI+1,j ), I is not paired E(SI+1,j-1 ), j is not paired min{E(S,i,k-1) + E(Sk+1,j )}, when i or j pairs with k, i<k<j}, E(LI,j ), when (I,j) base pairs and all special structures may appear within [embeds first formula of previous assumption] }

RNA Structure Prediction: More assumptions • Disregard free energies that do not belong to any loops • Added energy of only components is the final energy of the string: no interaction between components • Only 4 types of loops’ as in p249 for E(LI,j ), (can add more, if you know their energy parameterization)

RNA Structure Prediction: free energies for 4 loops • Hairpin loop of size k: Zi(k) • Additional stabilizing energy for two adjacent base pairs(in addition to a(r,k)): eta, constant • Destabilizing energy for bulge of size k: beta(k) • Destabilizing energy for interior loop of size k: gamma(k)

RNA Structure Prediction: E(LI,j ) • Hairpin: a(ri,rj) + zi(j-I+1) • Stacked-pair: a(ri,rj)+eta+E(Si+1,j-1) • Bulge on i: min{a(ri,rj)+beta(k)+ E(Si+k+1,j-1), k>=1 • Bulge on j: min{a(ri,rj)+beta(k)+ E(Si+1,j-k-1), k>=1 • Interior loop: min{a(ri,rj)+gamma(k1+k2)+ E(Si+k1+1,j-k2-1), k1,k2>=1

RNA Structure Prediction: complexity • O(n^2) table entries • On each entry: • First 2 formulae: O(1) leading to O(n^2) • Third formula: O(n) :: O(n^3) • 4.1 (E(L) hairpin): O(1) :: O(n^2) • 4.2: O(1) :: O(n^2) • 4.3: O(n), run on k :: O(n^3) • 4.4: O(n), run on k :: O(n^3) • 4.5: O(n^2), run on k1, k2 :: O(n^4) • Final complexity from 4.4: O(n^4)

Protein Threading • Interactions in proteins are between 20x20 residues, as opposed to 4x4 NA’a at most in RNA’s • Residue interactions are quite non-local, causing much more structural complexity • Proteins have frequent loops (helices are loops) • So, prediction by Ab initio is extremely difficult

Protein Threading • Number of protein folds are few (~1,000 for 20,000+ proteins) • Threading: map the target sequence over a template fold • Threading is an alignment problem, Torda, Fig1 • Find the fold to which target “aligns” optimally (minimum “energy” function) • Needs basic scoring functions as in sequence alignment

Protein Threading: number of folds • More the number of folds in database: more time to find correct template • Scoring function for threading is quite imperfect: need more available templates (contradictory requirements)

Protein Threading: Scoring functions • Full force field is not necessarily ideal: • it involves dynamics between molecules, stretch, torsion, etc. • Unimportant for a static alignment

Protein Threading: Scoring functions • Scoring function could be between residues from the same sequence: for coming close to each other on the alignment • Torda, Fig 5 • Example scoring function (free energy): • For pair of residues A and B to be at distance r (Torda, p7): G(AB) = kT ln(rho-rAB / rho-0-rAB), rho-rAB is probability of AB to be at distance r, rho-0 is probability of random occurrence of that (k,T usual)

Protein Threading: Scoring functions • Probabilities are collected from PDB proteins with known structure • Different threading scheme uses different scoring functions, but mostly they are derived from PDB

Protein Threading: Scoring functions • Example (Setubal-Meidanis, p257): • G1(I, ti) for placing i-th residue in sequence to the ti position in the fold • G2(I, j, ti, tj) simultaneous placements of i, j, for I<j • Constrained to be within a range, say bi<ti<ei

Protein Threading • Optimization is not only on placement, but also on multiple folds in database • Accuracy is very sensitive to alignment errors

Protein Threading: Dynamic programming • Advantage/disadvantage of DP is that it is deterministic • Problem: “adjacency” is hard to define in 3D

Protein Threading: Dynamic programming • DP: try out different combination of “adjacent” residues on different parts of a template (Torda, Fig 5c: adjacent comes from template sequence) • Start with smaller number of elements and build up to the full sequence • Alternative approach: start with placing each residue to one of its “possible” positions and see where next residue should go: continue residue by residue

Protein Threading: Probabilistic algorithm • Monte Carlo simulation: randomly throw residues at positions on fold and check aggregate scoring function • Simulated annealing: gradually move residues to optimize, stochastically making random shifts to avoid local optimum • Time consuming, & the result is non-deterministic

Protein Threading: Branch and bound • In the worst case try all possible alignments, but prune the search space for non-useful branches using some bounding function

Protein Threading: Search on folds • Divide and conquer over the space of folds • Assumption: folds can be ordered for their “goodness” for the target protein • Example: Setubal-Meidanis, p258

Protein Threading: Future • Slow • Subsumed by Ab intio of IBM Blue Gene™ type projects • De Novo technique using linear programming (Xu and Li, 2003) • Threading techniques are not only useful for structure prediction but for fold recognition problem also: no alignment, just find the template (fold suggests function)

Structure Prediction