EECS 730 Introduction to Bioinformatics Structure Comparison

EECS 730Introduction to BioinformaticsStructure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

Protein Structure Similarity

Secondary Structure Elements: a helices, b strands/sheets, & loops EECS 730

NMR spectrometry Structure Prediction/Determination • Computational tools • Homology, threading • Molecular dynamics • Experimental tools X-ray crystallography EECS 730

The State of the Strucutre Space Only about 10% of structures have been determined for known protein sequences  Protein Structure Initiative (PSI) 1990  250 new structures 1999  2500 new structures 2000  >20,000 structures total 2004  ~30,000 structures total EECS 730

Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Expected to reflect functional similarities (interaction with other molecules) Proteins in the TIM barrel fold family EECS 730

Alignment of 1xis and 1nar (TIM-Barrels) ribbon format Sayle, R. RasMol. A protein visualization tool. http://www.umass.edu/microbio/rasmol/index2.htm. 1xis 1nar backbone format Alignment computed by DALI ahelix axes EECS 730

Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2007: ~ 34,000 structures in PDB ~ 1,000 different folds (1:34 ratio) EECS 730

EECS 730

Structure Similarity • Refers to how well (or poorly) 3D folded structures of proteins can be aligned • Is expected to reflect functional similarities (interaction with other molecules) • 2000: ~ 20,000 structures in PDB ~ 4,000 different folds (1:5 ratio) • Three possible reasons: - evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination • Given a new structure, the probability is high that it is similar to an existing one EECS 730

sequencesimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures Sequence Structure Function EECS 730

Alignment of 1xis and 1nar (TIM-Barrels) 1xis and 1nar have only 7% sequenceidentity, but approximately 70% of the residues are structurally similar EECS 730

sequencesimilarity structuresimilarity Why Compute Structure Similarity? • Low sequence similarity may yield very similar structures • Sometimes high sequence similarity yields different structures • Structure comparison is expected to provide more pertinent information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships Sequence Structure Function EECS 730

Ill-Posed Problem Multiple Terminology • (Dis-)similarity analysis • Structure comparison • Alignment, superposition, matching • Classification • Definitions • Applications • Methods • Issues EECS 730

A Few Web Sites • Protein Data Bank (PDB):http://www.rcsb.org/pdb/ • Protein classification: • SCOP:http://scop.berkeley.edu/ • CATHhttp://www.biochem.ucl.ac.uk/bsm/cath/ • Protein alignment: • DALI:http://www.ebi.ac.uk/dali/ • LOCK:http://motif.stanford.edu/lock2/ EECS 730

3D Molecular Structure • Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement • The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction • The type can be the atom ID, the amino-acid ID, etc… EECS 730

Matching of Structures Two structures A and B match if: • Correspondence:There is a one-to-one map between their elements • Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold e. EECS 730

Complete Match EECS 730

But a complete match is rarely possible: • The molecules have different sizes • Their shapes are only locally similar Alignment of 3adk and 1gky Both matching and non-matching secondary structure elements EECS 730

Partial Match • Notion of support σ of the match: the match is between σ(A) and σ(B) •  Dual problem: - What is the support? - What is the transform? • Often several (many) possible supports • Small supports  motifs EECS 730

Mathematical Relative g f s ||f - g||2 Over which support? EECS 730

Application #1: Find Global Similarities Among Protein Structures • Given two protein structures, find the largest similar substructures • For example, a substructure is a subset of Ca atoms or a subset of secondary structure elements in each molecule • Several possible similarity measures • Variants: 1-to-1, 1-to-many, many-to-many (PDB) • Must be automatic (and fast) EECS 730

Application #2: Classify Proteins • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification  SCOP [Murzin et al., 1995] EECS 730

Application #2: Classify Proteins Class: Similar secondary structure content • Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] • Hierarchical classification • Insight into functions and structure stabilization • Basis for homology and threading • Manual classification  SCOP [Murzin et al., 1995] • Increasing size of PDB  Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander] Fold: SSE’s in similar arrangement Family: Clear evolutionary relationship EECS 730

Manuel vs. Automatic Classification EECS 730

Application #3: Find Motif in Protein Structure • Given a protein structure and a motif (e.g., a small collection of atoms corresponding to a binding site) • Find whether the motif matches a substructure of the protein • Variant: One motif against many proteins EECS 730 Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif

Application #4: Find Pharmacophore • Given: • Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site) • Low-energy conformations (several dozens to few 100’s) for each ligand • Find substructure (pharmacophore) that occurs in at least one conformation of each ligand • Key problem in drug design when binding site is unknown EECS 730

1TLP 4TMN 5TMN 6TMN The 4 ligands overlappedwith their pharmacophorematched Clusters of low-energy conformations of 1TLP Application #4: Find Pharmacophore Inhibitors of thermolysin EECS 730

Application #5: Search for Ligands Containing a Pharmacophore • Given: • Database containing several 100,000, or more, small ligands • A pharmacophore P • Find all ligands that have a low-energy conformation containing P • Data mining of pharmaceutical databases (lead generation) S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000 EECS 730

Definitions • Applications • Methods • Issues EECS 730

Multiple Partial Matches EECS 730

A A σ(B) B B σ(A) Gap Distributed Support EECS 730

A A B B What is Best? Should gaps be penalized? EECS 730

A B What About This? Sequence along backbone is not preserved EECS 730

 Similarity measure is unlikely to satisfy triangular inequality for partial match EECS 730

Compute Structure Similarity • Structure presentation • Similarity measurement • Computational solution EECS 730

Structure presentation • Element based representation • A structure is broken down to a list of structure elements • We represent a protein structure by its geometry, topology, and attributes: • Geometry: the coordinates of the elements • Topology: the physical and chemical interaction of elements • Attributes: the physical and chemical attributes of the elements EECS 730

Structure Representation • There are three major groups of structure presentation • Point list: treat protein as a list of points in a 3D space • Point set: treat protein as a set of points in a 3D space • Graphs: treat protein as a graph EECS 730

Comparing two point sets • Similarity measure: Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that S (P, Q) = sqrt( id2(pi, T(f(pi)) ) is minimized . S is called the RMSD (root-mean-spared-distance) between the two structures EECS 730

Comparing two point sets • If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets • If m ≠n, the problem is much harder EECS 730

Common Point Subset Problem • Find the largest common point subset • Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinalityfrom P to Q such that d(pi, T(f(pi)) ) < t for all i defined in f • Also a harder problem (but not a NP-hard problem) EECS 730

Geometric Hashing • Originally used for automatic visual recognition of geometric figures • The principle • We have two geometric figures • modelA with m points (can have several models) • quaryB with n points • Discover similar subfigures in A and B invariant under placement, rotation (and often size) • Let the figures be described by points • Try to find the largest set of points from (A, B) with coinciding points EECS 730

Coinciding points • Example from 2 dimension • Find six overlapping pairs • (1,a)(2,d)(3,c)(4,e)(6,f)(7,g) • The coinciding pairs are independent of the labeling • Note that the figures can be translated and rotated EECS 730

Reference frames • The points of the figures are specified in coordinate systems or reference frames • A reference frame can in 2D be defined by two points • Choose two points from A (ai,ak) and two from B (bj, bl), called basises, and define the reference frames (RF) from the basises • Example: origin in ai and the x-axis along the line ai,ak, or origin at the middle of ai,ak • Find the positions in RF of all the other points, called reference frame system, RFS • ”Overlap” (the x,y-axes) RFSA and RFSB, and count the number of coinciding points EECS 730

Reference frame system, example • Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)] • four coinciding points • Query (a,c) [(0,0)(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)] • only the origins coincidies • Model (3,5) [(0,0)(1,8)(2,2)(4,-2) (10,0)(8,3)(8,7)] EECS 730

Comparison of (Reference) Frame Systems • The number of coinciding points depends on the basises • Should therefore try all possible pairs as basises • This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant • Geometric hashing is used for efficiently performing ”simultaneously” many comparisons EECS 730

Hashing • Compare simultaneously a query frame system to allmodel frame systems • Assume a 2D hashing table H, a simple hashing function • One bucket for each square of the frame system, identified by (p,q) • Let (u,v)eH(p,q) mean that the frame system with basis (u,v) has a point in the square (p,q) (a very simple hash function) • H is filled in a preprocessing of the model EECS 730

Hashing preprocessing example EECS 730

EECS 730 Introduction to Bioinformatics Structure Comparison

EECS 730 Introduction to Bioinformatics Structure Comparison

Presentation Transcript

Introduction to Bioinformatics

Introduction to Bioinformatics

IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

Introduction to BioInformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

EECS 730 Introduction to Bioinformatics Genome and Gene

Introduction to Bioinformatics

EECS 730 Introduction to Bioinformatics Microarray

EECS 730 Introduction to Bioinformatics Function

Introduction to Bioinformatics

Introduction to Bioinformatics

EECS 730 Introduction to Bioinformatics Introduction to Proteomics

EECS 730 Introduction to Bioinformatics Genome and Gene

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

BIOINFORMATICS Sequence to Structure

Introduction to Bioinformatics