630 likes | 813 Views
Seminar in structural bioinformatics. Multiple structural alignment of proteins. By Elad Kaspani. Multiple structural alignment. Outline. Introduction What is Multiple structural alignment? Why do we need Multiple structural alignment? Pairwise Vs. Multiple structural alignment
E N D
Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani
Outline • Introduction • What is Multiple structural alignment? • Why do we need Multiple structural alignment? • Pairwise Vs. Multiple structural alignment • MASS - Multiple structural alignment by secondary structures • Problem definition • General strategy • Algorithm description
Outline Cont. • MASS - Multiple structural alignment by secondary structures • Algorithm outline • Complexity • Results Discussion • Summary & Conclusions
Introduction • Proteins sharing a common substructure may have a similar function. • What is Multiple structural alignment ? • Discussion – we already have pairwise alignment, isn’t that enough?
Pairwise Vs. Multiple structural alignment • We have many algorithms pairwise structural alignment task • Only a few methods are available for aligning multiple structures • Most of them are based on series of pairwise comparisons • SSAPm (Taylor et al., 1994) • Prism (Yang and Honig, 2000b) • STAMP (Russell and Barton, 1992)
What do we want? • Classification of existing and newly discovered proteins • Gaining insights into evolutionary relations between proteins • Detecting motifs common to a group of proteins that share a certain function • Structure prediction algorithms
What’s wrong with methods based on series of pairwise comparisons ???
Multiple structural alignment • These methods are limited!!! • In each pairwise comp. , the only information is about the two molecules • alignments optimal for the whole set can be disregarded • dynamic programming disadvantage - dependent on the sequence order of the polypeptide chain • We can’t see the woods
WHAT DO WE DO THEN????????????? • multiple structural alignment by secondary structures MASS
MASS • Considers all the given structures at the same time • Exploiting the secondary structure representation - reduced time complexity • Does not require that all the input molecules be aligned • Capable of detecting structural motifs shared only by a subset of the molecules
MASS • Can find non-sequential and even non-topological structural motifs • Suitable for a broad range of applications • filter noisy results • highly efficient and robust • Other multiple-based methods • (Escalier et al., 1988) • MUSTA (Leibowitz et al., 2001) • MultiProt (Shatsky et al., 2002)
Basic terms • rigid transformation • Q - a subset • T (Q) =R(Q) + t where R is a 3x3 rotation matrix and t is a translation vector • ε-congruent • For ε>0, find two largest subsets of the input sets, P and Q, and a rigid transformation, T, so that distance(P, T (Q)) < ε • How do we measure distance? • RMSD
Problem Definition • The pairwise case: • given two proteins, represented by a set of points in 3D space • each point is associated with an atom’s position • find the largest set that is congruent to two subsets of points from each protein • In computational geometry - largest common point set (LCP) problem
Problem Definition • The multiple case: • given a collection of m point sets, • find the largest set of points, of which an ε-congruent copy appears in each of the input sets • Unfortunately, it’s NP-hard..... • We want not only the largest set of points, but also smaller common substructures
Problem Definition • The multiple subset case: • find solutions where only a subset of the input proteins is well aligned • this complicates the problem ! (why?) • number of subsets is exponential • trade-off between the size of the subset and the size of its core (match list) • scoring function (core size – L, proteins # -k) f(l,k) = k ( . ) L 2
Method • Input : • a set of m proteins P1, P2, . . . , Pm. • For each protein • the sequence of the 3D coordinates of atoms • assignment of SSE types to each residue • Output : • The multiple alignments with the largest cores, according to the scoring function.
General strategy • We want multiple alignments with at least two SSEs • Bases – ordered pairs of SSEs whose ε-congruent copies appear in several proteins • We look for a set of ε-congruent bases {b1, b2, . . . , bk}, from proteins Pi1, Pi2, . . . , Pikrespectively. • First base (b1) is our pivot
General strategy – cont. • Compute all the k − 1 rigid transformations between this base and the others • Result - (T12, T13, . . . , T1k) defines multiple alignment between Pi1, Pi2, . , Pik • The core may contain more then one base • we will get several alignments with almost the same transformations • (one alignment per base in the core)
General strategy – cont. • Cluster the initial multiple base alignments • Merge thealignment. the core of the new alignment is the union of the cores of the original alignments. • We get smaller set of multiple alignments • Extend the clustered alignments • Find additional matching residues • Give a score to each alignment • Report the highest scoring alignments
Algorithm outline - stage 1 • Representation of secondary structure elements: • Axis representation for SSEs • The least squares line from all the Cα atoms • Direction & length determined by protein structure
Algorithm outline – stage 2 • Detection of multiple base alignments: • Use Geometric Hashing to detect bases whose ε-congruent copies appear in several proteins • Each base has fingerprint • invariant to a 3D rigid transformation • the types of the two SSEs • the angle between their axial vectors • the midpoint-to-midpoint distance • their line distance
Algorithm outline – stage 2 • Almost-congruent bases have similar fingerprints • the types of their SSEs are the same • the difference between their midpoint-to-midpoint and line distances is up to 1.5 Å • difference between their angles is up to 0.3 radians • reside close to each other in the grid
Algorithm outline – stage 2 • For each grid bin, extract all the bases of the bin and of adjacent bins • Group them together in the same base bucket • Base bucket - stores bases in columns according to the protein they belong to • Bases derived from the same protein are stored in the same column
Base bucket Almost-congruent bases are stored in the same base bucket
Stage 2 cont. • A collection of almost-congruent bases, each belonging to a different column induces a local multiple alignment between the respective proteins • core consists of at least two SSEs • One basis is selected as a pivot • rest of the bases are superimposed on it • Selection of the pivot may influence the alignment • Optional – try each base as pivot
Stage 2 cont. • Multiple alignment is defined by an underlying set of pairwise alignments • For each base bucket we compute all the alignments between two bases taken from two different columns • find the transformation between two bases that aligns the maximal number of atoms with minimal RMSD
Stage 3 - Clustering • For pair of proteins that share more then one base • We get more alignments with almost the same transformation, but a different local SSE core • Cluster all the local base alignments to find the ones with similar transformations • merge them into a new global alignment • The match list (core) of the new global alignment • union of the original local match lists • its transformation is the one that aligns the SSEs with minimal RMSD
Stage 4 - Global extension • Now the core of each pairwise alignment is a set of SSEs • Then we extend these alignments by finding additional matching residues • The residues not necessarily belong to SSEs • We want to extend the cores of these alignments by detecting corresponding Cα atoms • We want to transform the second protein, so that it is fully superimposed onto the pivot protein
Stage 4 - Global extension • Detect in linear time close pairs of C atoms, one atom from each protein • These atom pairs are added to the alignment’s match list • transformation of the alignment is refined by employing the Least-Squares Fitting method
Stage 5 – Filtering & Scoring • Computing the best global multiple alignments • What are the best global multiple alignments? • Number of aligned molecules Vs. core size • core size Vs. size of the smallest molecule • number of possible multiple alignments defined by the base buckets is exponential • We do not compute all of them
Stage 5 – Filtering & Scoring • Heuristic solution: • For each BB compute the set of best multiple alignments recursively over the colomns • For a set of multiple base alignments, obtained by last stage (b1, . . . , bk) • Check if there is a base, bk+1, from the current column that improve the alignment’s score Core(b1, . . . , bk+1) = Core(b1, . . . , bk)∩Core(b1, bk+1)
Stage 5 – Filtering & Scoring • Our scoring function • Core size – L • Proteins number - k • f(l,k) = k • Report the highest scoring alignments • Finish ! ( ) . L 2
Complexity • Worst case complexity: • (i) m is the number of proteins • (ii) k is the number of residues in an SSE • (iii) s and n are the number of SSEs and the number of residues found in each protein respectively. • n ~ 300, k ~ 10, s ~ 15 • The number of bases for each protein is O(s 2)
Complexity • For each pair of proteins we construct, cluster and extend O(s 4) pairwise alignments. • This results in O(m 2(s4k3 +s8 log s +s4n)) time where O(m2) is the number of ways of pairing two proteins • In practice, the complexity is much smaller • we only construct the pairwise alignments defined by the BBs and the clustering reduces their number even more
Complexity • The number of evaluated multiple alignments is linear in the number of bases • Each base can be a pivot for only one multiple alignment • We have O(ms2) bases • It takes O(ms2n) time to construct a single multiple alignment and O(m2s4n) time to construct all of them • Running time for intire algorithm is bounded by O(m2s4(k3 + s2 log s + n)), but experiments show that the actual running time is significantly lower
Experiment 1 • Example 1 - Detection of subset alignments and their use for structural classification • We have used MASS to align a set of 12 structures from two families: • Cofilin-like (CL) • Gelsolin-like (GL) • The two families are related structurally but not sequentially
Experiment 1 • The 12-molecule ensemble contains: • four CL structures • eight GL • The running time of MASS on this ensemble was 36 sec. • (Pentium 4 1800 MHz processor)
Experiment 1: Results (A) The structural alignment of all 12 proteins of the ensemble. (B) A subset alignment between only the eight GL proteins.
Experiment 1: Results (C) A subset alignment between only the four CL structures. (D) A subset alignment between only three out of the four CL structures.
Results Discussion • As expected, the maximal core size decreases as the number of aligned molecules increases • The dependence is not linear: • Large decrease between three to four molecules • Between four to five molecules • Between eight to nine molecules
Experiment 2 • Non-topological motif detection • The ensembles share a common SSE motif, but different topology. • In topological motifs, the order and the direction of the corresponding SSEs along the polypeptide chain are conserved while in non-topological they are not.