1 / 54

EECS 730 Introduction to Bioinformatics Structure Comparison

EECS 730 Introduction to Bioinformatics Structure Comparison. Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/. Protein Structure Similarity. Secondary Structure Elements: a helices , b strands/sheets , & loops. NMR spectrometry.

twilam
Download Presentation

EECS 730 Introduction to Bioinformatics Structure Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 730Introduction to BioinformaticsStructure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

  2. Protein Structure Similarity

  3. Secondary Structure Elements: a helices, b strands/sheets, & loops EECS 730

  4. NMR spectrometry Structure Prediction/Determination • Computational tools • Homology, threading • Molecular dynamics • Experimental tools X-ray crystallography EECS 730

  5. The State of the Strucutre Space Only about 10% of structures have been determined for known protein sequences  Protein Structure Initiative (PSI) 1990  250 new structures 1999  2500 new structures 2000  >20,000 structures total 2004  ~30,000 structures total EECS 730

  6. Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Expected to reflect functional similarities (interaction with other molecules) Proteins in the TIM barrel fold family EECS 730

  7. Alignment of 1xis and 1nar (TIM-Barrels) ribbon format Sayle, R. RasMol. A protein visualization tool. http://www.umass.edu/microbio/rasmol/index2.htm. 1xis 1nar backbone format Alignment computed by DALI ahelix axes EECS 730

  8. Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2007: ~ 34,000 structures in PDB ~ 1,000 different folds (1:34 ratio) EECS 730

  9. EECS 730

  10. EECS 730

  11. Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2000: ~ 20,000 structures in PDB ~ 4,000 different folds (1:5 ratio) • Three possible reasons: - evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination • Given a new structure, the probability is high that it is similar to an existing one EECS 730

  12. sequencesimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures Sequence Structure Function EECS 730

  13. Alignment of 1xis and 1nar (TIM-Barrels) 1xis and 1nar have only 7% sequenceidentity, but approximately 70% of the residues are structurally similar EECS 730

  14. sequencesimilarity structuresimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures • Structure comparison is expected to provide more pertinent information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships Sequence Structure Function EECS 730

  15. Ill-Posed Problem Multiple Terminology • (Dis-)similarity analysis • Structure comparison • Alignment, superposition, matching • Classification • Definitions • Applications • Methods • Issues EECS 730

  16. A Few Web Sites • Protein Data Bank (PDB):http://www.rcsb.org/pdb/ • Protein classification: • SCOP:http://scop.berkeley.edu/ • CATHhttp://www.biochem.ucl.ac.uk/bsm/cath/ • Protein alignment: • DALI:http://www.ebi.ac.uk/dali/ • LOCK:http://motif.stanford.edu/lock2/ EECS 730

  17. 3D Molecular Structure • Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement • The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction • The type can be the atom ID, the amino-acid ID, etc… EECS 730

  18. Matching of Structures Two structures A and B match if: • Correspondence:There is a one-to-one map between their elements • Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold e. EECS 730

  19. Complete Match EECS 730

  20. But a complete match is rarely possible: • The molecules have different sizes • Their shapes are only locally similar Alignment of 3adk and 1gky Both matching and non-matching secondary structure elements EECS 730

  21. Partial Match • Notion of support σ of the match: the match is between σ(A) and σ(B) •  Dual problem: - What is the support? - What is the transform? • Often several (many) possible supports • Small supports  motifs EECS 730

  22. Mathematical Relative g f s ||f - g||2 Over which support? EECS 730

  23. Mathematical Relative g f s ||f - g||2 Over which support? EECS 730

  24. Application #1: Find Global Similarities Among Protein Structures • Given two protein structures, find the largest similar substructures • For example, a substructure is a subset of Ca atoms or a subset of secondary structure elements in each molecule • Several possible similarity measures • Variants: 1-to-1, 1-to-many, many-to-many (PDB) • Must be automatic (and fast) EECS 730

  25. Application #2: Classify Proteins • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification  SCOP [Murzin et al., 1995] EECS 730

  26. Application #2: Classify Proteins Class: Similar secondary structure content • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification  SCOP [Murzin et al., 1995] • Increasing size of PDB  Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander] Fold: SSE’s in similar arrangement Family: Clear evolutionary relationship EECS 730

  27. Manuel vs. Automatic Classification EECS 730

  28. Application #3: Find Motif in Protein Structure • Given a protein structure and a motif (e.g., a small collection of atoms corresponding to a binding site) • Find whether the motif matches a substructure of the protein • Variant: One motif against many proteins EECS 730 Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif

  29. Application #4: Find Pharmacophore • Given: • Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site) • Low-energy conformations (several dozens to few 100’s) for each ligand • Find substructure (pharmacophore) that occurs in at least one conformation of each ligand • Key problem in drug design when binding site is unknown EECS 730

  30. 1TLP 4TMN 5TMN 6TMN The 4 ligands overlappedwith their pharmacophorematched Clusters of low-energy conformations of 1TLP Application #4: Find Pharmacophore Inhibitors of thermolysin EECS 730

  31. Application #5: Search for Ligands Containing a Pharmacophore • Given: • Database containing several 100,000, or more, small ligands • A pharmacophore P • Find all ligands that have a low-energy conformation containing P • Data mining of pharmaceutical databases (lead generation) S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000 EECS 730

  32. Definitions • Applications • Methods • Issues EECS 730

  33. Multiple Partial Matches EECS 730

  34. A A σ(B) B B σ(A) Gap Distributed Support EECS 730

  35. A A B B What is Best? Should gaps be penalized? EECS 730

  36. A B What About This? Sequence along backbone is not preserved EECS 730

  37. Similarity measure is unlikely to satisfy triangular inequality for partial match EECS 730

  38. Compute Structure Similarity • Structure presentation • Similarity measurement • Computational solution EECS 730

  39. Structure presentation • Element based representation • A structure is broken down to a list of structure elements • We represent a protein structure by its geometry, topology, and attributes: • Geometry: the coordinates of the elements • Topology: the physical and chemical interaction of elements • Attributes: the physical and chemical attributes of the elements EECS 730

  40. Structure Representation • There are three major groups of structure presentation • Point list: treat protein as a list of points in a 3D space • Point set: treat protein as a set of points in a 3D space • Graphs: treat protein as a graph EECS 730

  41. Comparing two point sets • Similarity measure: Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that S (P, Q) = sqrt( id2(pi, T(f(pi)) ) is minimized . S is called the RMSD (root-mean-spared-distance) between the two structures EECS 730

  42. Comparing two point sets • If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets • If m ≠n, the problem is much harder EECS 730

  43. Common Point Subset Problem • Find the largest common point subset • Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinalityfrom P to Q such that d(pi, T(f(pi)) ) < t for all i defined in f • Also a harder problem (but not a NP-hard problem) EECS 730

  44. Geometric Hashing • Originally used for automatic visual recognition of geometric figures • The principle • We have two geometric figures • modelA with m points (can have several models) • quaryB with n points • Discover similar subfigures in A and B invariant under placement, rotation (and often size) • Let the figures be described by points • Try to find the largest set of points from (A, B) with coinciding points EECS 730

  45. Coinciding points • Example from 2 dimension • Find six overlapping pairs • (1,a)(2,d)(3,c)(4,e)(6,f)(7,g) • The coinciding pairs are independent of the labeling • Note that the figures can be translated and rotated EECS 730

  46. Reference frames • The points of the figures are specified in coordinate systems or reference frames • A reference frame can in 2D be defined by two points • Choose two points from A (ai,ak) and two from B (bj, bl), called basises, and define the reference frames (RF) from the basises • Example: origin in ai and the x-axis along the line ai,ak, or origin at the middle of ai,ak • Find the positions in RF of all the other points, called reference frame system, RFS • ”Overlap” (the x,y-axes) RFSA and RFSB, and count the number of coinciding points EECS 730

  47. Reference frame system, example • Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)] • four coinciding points • Query (a,c) [(0,0)(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)] • only the origins coincidies • Model (3,5) [(0,0)(1,8)(2,2)(4,-2) (10,0)(8,3)(8,7)] EECS 730

  48. Comparison of (Reference) Frame Systems • The number of coinciding points depends on the basises • Should therefore try all possible pairs as basises • This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant • Geometric hashing is used for efficiently performing ”simultaneously” many comparisons EECS 730

  49. Hashing • Compare simultaneously a query frame system to allmodel frame systems • Assume a 2D hashing table H, a simple hashing function • One bucket for each square of the frame system, identified by (p,q) • Let (u,v)eH(p,q) mean that the frame system with basis (u,v) has a point in the square (p,q) (a very simple hash function) • H is filled in a preprocessing of the model EECS 730

  50. Hashing preprocessing example EECS 730

More Related