540 likes | 548 Views
EECS 730 Introduction to Bioinformatics Structure Comparison. Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/. Protein Structure Similarity. Secondary Structure Elements: a helices , b strands/sheets , & loops. NMR spectrometry.
E N D
EECS 730Introduction to BioinformaticsStructure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/
Secondary Structure Elements: a helices, b strands/sheets, & loops EECS 730
NMR spectrometry Structure Prediction/Determination • Computational tools • Homology, threading • Molecular dynamics • Experimental tools X-ray crystallography EECS 730
The State of the Strucutre Space Only about 10% of structures have been determined for known protein sequences Protein Structure Initiative (PSI) 1990 250 new structures 1999 2500 new structures 2000 >20,000 structures total 2004 ~30,000 structures total EECS 730
Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Expected to reflect functional similarities (interaction with other molecules) Proteins in the TIM barrel fold family EECS 730
Alignment of 1xis and 1nar (TIM-Barrels) ribbon format Sayle, R. RasMol. A protein visualization tool. http://www.umass.edu/microbio/rasmol/index2.htm. 1xis 1nar backbone format Alignment computed by DALI ahelix axes EECS 730
Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2007: ~ 34,000 structures in PDB ~ 1,000 different folds (1:34 ratio) EECS 730
Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2000: ~ 20,000 structures in PDB ~ 4,000 different folds (1:5 ratio) • Three possible reasons: - evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination • Given a new structure, the probability is high that it is similar to an existing one EECS 730
sequencesimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures Sequence Structure Function EECS 730
Alignment of 1xis and 1nar (TIM-Barrels) 1xis and 1nar have only 7% sequenceidentity, but approximately 70% of the residues are structurally similar EECS 730
sequencesimilarity structuresimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures • Structure comparison is expected to provide more pertinent information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships Sequence Structure Function EECS 730
Ill-Posed Problem Multiple Terminology • (Dis-)similarity analysis • Structure comparison • Alignment, superposition, matching • Classification • Definitions • Applications • Methods • Issues EECS 730
A Few Web Sites • Protein Data Bank (PDB):http://www.rcsb.org/pdb/ • Protein classification: • SCOP:http://scop.berkeley.edu/ • CATHhttp://www.biochem.ucl.ac.uk/bsm/cath/ • Protein alignment: • DALI:http://www.ebi.ac.uk/dali/ • LOCK:http://motif.stanford.edu/lock2/ EECS 730
3D Molecular Structure • Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement • The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction • The type can be the atom ID, the amino-acid ID, etc… EECS 730
Matching of Structures Two structures A and B match if: • Correspondence:There is a one-to-one map between their elements • Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold e. EECS 730
Complete Match EECS 730
But a complete match is rarely possible: • The molecules have different sizes • Their shapes are only locally similar Alignment of 3adk and 1gky Both matching and non-matching secondary structure elements EECS 730
Partial Match • Notion of support σ of the match: the match is between σ(A) and σ(B) • Dual problem: - What is the support? - What is the transform? • Often several (many) possible supports • Small supports motifs EECS 730
Mathematical Relative g f s ||f - g||2 Over which support? EECS 730
Mathematical Relative g f s ||f - g||2 Over which support? EECS 730
Application #1: Find Global Similarities Among Protein Structures • Given two protein structures, find the largest similar substructures • For example, a substructure is a subset of Ca atoms or a subset of secondary structure elements in each molecule • Several possible similarity measures • Variants: 1-to-1, 1-to-many, many-to-many (PDB) • Must be automatic (and fast) EECS 730
Application #2: Classify Proteins • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification SCOP [Murzin et al., 1995] EECS 730
Application #2: Classify Proteins Class: Similar secondary structure content • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification SCOP [Murzin et al., 1995] • Increasing size of PDB Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander] Fold: SSE’s in similar arrangement Family: Clear evolutionary relationship EECS 730
Manuel vs. Automatic Classification EECS 730
Application #3: Find Motif in Protein Structure • Given a protein structure and a motif (e.g., a small collection of atoms corresponding to a binding site) • Find whether the motif matches a substructure of the protein • Variant: One motif against many proteins EECS 730 Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif
Application #4: Find Pharmacophore • Given: • Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site) • Low-energy conformations (several dozens to few 100’s) for each ligand • Find substructure (pharmacophore) that occurs in at least one conformation of each ligand • Key problem in drug design when binding site is unknown EECS 730
1TLP 4TMN 5TMN 6TMN The 4 ligands overlappedwith their pharmacophorematched Clusters of low-energy conformations of 1TLP Application #4: Find Pharmacophore Inhibitors of thermolysin EECS 730
Application #5: Search for Ligands Containing a Pharmacophore • Given: • Database containing several 100,000, or more, small ligands • A pharmacophore P • Find all ligands that have a low-energy conformation containing P • Data mining of pharmaceutical databases (lead generation) S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000 EECS 730
Definitions • Applications • Methods • Issues EECS 730
Multiple Partial Matches EECS 730
A A σ(B) B B σ(A) Gap Distributed Support EECS 730
A A B B What is Best? Should gaps be penalized? EECS 730
A B What About This? Sequence along backbone is not preserved EECS 730
Similarity measure is unlikely to satisfy triangular inequality for partial match EECS 730
Compute Structure Similarity • Structure presentation • Similarity measurement • Computational solution EECS 730
Structure presentation • Element based representation • A structure is broken down to a list of structure elements • We represent a protein structure by its geometry, topology, and attributes: • Geometry: the coordinates of the elements • Topology: the physical and chemical interaction of elements • Attributes: the physical and chemical attributes of the elements EECS 730
Structure Representation • There are three major groups of structure presentation • Point list: treat protein as a list of points in a 3D space • Point set: treat protein as a set of points in a 3D space • Graphs: treat protein as a graph EECS 730
Comparing two point sets • Similarity measure: Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that S (P, Q) = sqrt( id2(pi, T(f(pi)) ) is minimized . S is called the RMSD (root-mean-spared-distance) between the two structures EECS 730
Comparing two point sets • If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets • If m ≠n, the problem is much harder EECS 730
Common Point Subset Problem • Find the largest common point subset • Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinalityfrom P to Q such that d(pi, T(f(pi)) ) < t for all i defined in f • Also a harder problem (but not a NP-hard problem) EECS 730
Geometric Hashing • Originally used for automatic visual recognition of geometric figures • The principle • We have two geometric figures • modelA with m points (can have several models) • quaryB with n points • Discover similar subfigures in A and B invariant under placement, rotation (and often size) • Let the figures be described by points • Try to find the largest set of points from (A, B) with coinciding points EECS 730
Coinciding points • Example from 2 dimension • Find six overlapping pairs • (1,a)(2,d)(3,c)(4,e)(6,f)(7,g) • The coinciding pairs are independent of the labeling • Note that the figures can be translated and rotated EECS 730
Reference frames • The points of the figures are specified in coordinate systems or reference frames • A reference frame can in 2D be defined by two points • Choose two points from A (ai,ak) and two from B (bj, bl), called basises, and define the reference frames (RF) from the basises • Example: origin in ai and the x-axis along the line ai,ak, or origin at the middle of ai,ak • Find the positions in RF of all the other points, called reference frame system, RFS • ”Overlap” (the x,y-axes) RFSA and RFSB, and count the number of coinciding points EECS 730
Reference frame system, example • Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)] • four coinciding points • Query (a,c) [(0,0)(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)] • only the origins coincidies • Model (3,5) [(0,0)(1,8)(2,2)(4,-2) (10,0)(8,3)(8,7)] EECS 730
Comparison of (Reference) Frame Systems • The number of coinciding points depends on the basises • Should therefore try all possible pairs as basises • This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant • Geometric hashing is used for efficiently performing ”simultaneously” many comparisons EECS 730
Hashing • Compare simultaneously a query frame system to allmodel frame systems • Assume a 2D hashing table H, a simple hashing function • One bucket for each square of the frame system, identified by (p,q) • Let (u,v)eH(p,q) mean that the frame system with basis (u,v) has a point in the square (p,q) (a very simple hash function) • H is filled in a preprocessing of the model EECS 730
Hashing preprocessing example EECS 730