• 440 likes • 613 Views
Reductionism and Classification Require Detailed Comparison Consider 3D Comparison. PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Reading Chapter 16, Structural Bioinformatics.
E N D
Reductionism and Classification Require Detailed ComparisonConsider 3D Comparison PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Reading Chapter 16, Structural Bioinformatics PHAR 201 Lecture 06, 2012
This course is intended to be a workflow starting with understanding and representing complex biological data through analysis of that data to discoveries made through that analysis Understand the scope and complexity of the data Understand the experiment and the subsequent curation to understand the errors Understand how to best represent the data Identify relationships In the data Recognize redundancy In the data Classify the data Explore algorithms that analyze the data Make new biological discoveries from the data PHAR 201 Lecture 06, 2012
Next Lecture • We will establish the complex relationship between: • Sequence and Structure • Structure and Structure • Structure and Function • Today we analyze how the relationships between structure and structure are established as a prerequisite PHAR 201 Lecture 06, 2012
Agenda • Understand why structure comparison is important • Understand why it is not a solved problem • Understand the basics of the methods used to address the problem • Understand one method (CE) in more detail • Review an example where structure comparison has revealed new biological insights PHAR 201 Lecture 06, 2012
Why Structure Comparison is Important • Reductionism – needed to classify protein structures • Functional assignment and hopefully new biology • Alignment of predicted structure against structural templates • Establish improved sequence relationships not possible from sequence alone • Protein engineering PHAR 201 Lecture 06, 2012
Distinctions - Structure Superposition and Structure Comparison and Alignment are Different • Structure superposition assumes you already know which atoms to superimpose – it merely optimizes for the atoms chosen (relatively simple) • Structure alignment must first determine what atoms to align (difficult). We are concerned with alignment PHAR 201 Lecture 06, 2012
Distinction – There are many structure alignment methods – how to compare? • http://en.wikipedia.org/wiki/Structural_alignment_software PHAR 201 Lecture 06, 2012
PHAR 201 Lecture 06, 2012 See also http://en.wikipedia.org/wiki/Structural_alignment_software
Distinctions – Pair-wise Alignments are Different from Multiple Structure Alignments • Multiple structure alignment algorithms are rarer and of questionable quality (see for example Nucleic Acids Research (2004), 32 W100-W103 • Multiple structure alignments should not be confused with multiple pair-wise alignments • Here we focus on single pair-wise comparison and alignment PHAR 201 Lecture 06, 2012
Why is it Not a Solved Problem? PHAR 201 Lecture 06, 2012
Current State of the Art • There are many papers published on this, but relatively few have code to download or Web sites from which to perform or lookup comparisons • All methods can identify obvious similarities between two structures • Remote similarities are detected by a subset of methods – different remote similarities are recognized by different methods • Good alignments are much harder to come by • Speed is a serious issue with some algorithms PHAR 201 Lecture 06, 2012
Desirables • Biologically meaningful alignments not just geometrically meaningful • Complete database of all alignments • Ability to apply to structures not in the PDB PHAR 201 Lecture 06, 2012
Biological vs Geometric Alignments Plastocyanin versus Azurin (from Godzik 1996) Maintain 9 of 10 interactions RMSD 1.5 Å Maintain 5 of 10 interactions RMSD 0.5 Å PHAR 201 Lecture 06, 2012
Literature Alignments - Flavodoxin vs Che Y Protein From Godzik (1996) Protein Science, 5, 1325-1338. PHAR 201 Lecture 06, 2012
Understand the basics of the methods used to address the problem PHAR 201 Lecture 06, 2012
How to Compare Structures? Structure 1 Structure 2 Feature extraction 1. Structure description 1 Structure description 2 Comparison algorithm 2. 3. Scores Statistical significance Similarity, classification PHAR 201 Lecture 06, 2012
Components of Structure Alignment • Structure Description • Local geometry • Side chain contacts • Geometric hashing • Distance matrix (eg Dali, 1993) • Properties (secondary structure, hydrophobic clusters) (eg Comparer, 1990) • Secondary structure elements (eg VAST, 1996) • Distances of inter & intra aligned fragment pairs (eg CE, 1998) • Contact map (eg Celera, 2004) • Geometry invariants (eg Jia et al, 2004) PHAR 201 Lecture 06, 2012
Components of Structure Alignment 2. Alignment algorithms • Monte Carlo (eg Dali, VAST) • Heuristics (eg CE) • Dynamic Programming (eg CE) • Probabilistic • Statistical significance PHAR 201 Lecture 06, 2012
Components of Global Structure Alignment 2. Alignment algorithms • Input & output of alignment algorithm Input: two proteins: and Output: An alignment and scores Constraints: min rmsd: max L min Gaps: • Dynamic programming, Integer programming, Monte Carlo… 3. Statistical significance • Levitt and Gerstein, PNAS, 1998 (EVD) • Random Model and CE scoring function (Jia et al, 2004) (ROC)
From Levitt & Gerstein 1998 • “Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.” PHAR 201 Lecture 06, 2012
Understand one method (CE) in more detail I.N. Shindyalov and P.E. Bourne Protein Engineering 1998, 11(9) 739-747. Protein Structure Alignment by Incremental Combinatorial Extension of the Optimum Path. [PDF File] 1590 citations (GS) PHAR 201 Lecture 06, 2012
Basic Approach • Compare octameric fragments – an aligned fragment pair (AFP) (local alignments) • Stitch together AFPs • Find the optimal path through the AFPs • Optimize the alignment through dynamic programming • Measure the statistical significance of the alignment PHAR 201 Lecture 06, 2012
Why This Approach?Alignment Space is Very Large and Must be Constrained Without Loosing Meaningful Alignments Similarity Matrix S where: S=(nA-m).(nB-m) m = Length of AFP nA = Length of protein A This is very large to compute – constraints are needed PHAR 201 Lecture 06, 2012
Definition of the Alignment Path • pAi = AFPs starting residue position in protein A at the ith position • of the alignment path • m = longest continual path – set as 8 • One of the conditions (1)-(3) should be satisfied for 2 consecutive AFPs i • and i+1 in the path • = 2 consecutive AFPs aligned without gaps • = Two consecutive AFPs with a gap in protein A • = Two consecutive AFPs with a gap in protein B PHAR 201 Lecture 06, 2012
Extension of the Alignment Path Gap sizes are limited to G – heuristically set as 30 residues PHAR 201 Lecture 06, 2012
Evaluation based upon the following three distance similarity measures 1. Distance calculated from independent set of inter-residue distances where each distance is used only once - used for combinations of 2 AFPs 2. Full set of inter-residue distances - used for a single AFP 3. RMSD from least squares superposition - used to select few best fragments PHAR 201 Lecture 06, 2012
Evaluation Based Upon the Following Three Distance Similarity Measures 1. Distance calculated from independent set of inter-residue distances where each distance is used only once 2. Full set of inter-residue distances 3. RMSD from least squares superposition PHAR 201 Lecture 06, 2012
How to Extend the Path? 1. Consider all possible AFPs that extend the path 2. Consider only the best AFP 3. Use some intermediate strategy PHAR 201 Lecture 06, 2012
How to Extend the Path? 1. Consider all possible AFPs that extend the path Computationally expensive 2. Consider only the best AFP Works well with the right heuristics 3. Use some intermediate strategy PHAR 201 Lecture 06, 2012
What Heuristics? Candidate AFPs are based upon (9) D0 = 3Å The best AFP is based upon (10) D1 = 4Å The decision to extend or terminate the path is based upon (11) PHAR 201 Lecture 06, 2012
Z-Score • Evaluate the probability of finding an alignment path of the same length or smaller gaps and distance from a random set of non-redundant structures PHAR 201 Lecture 06, 2012
Optimization of the Final Path The 20 best alignments with a Z score above 3.5 are assessed based on RMSD and the best kept. This produces approx. one error in 1000 structures Each gap in this alignment is assessed for relocation up to m/2 Iterative optimization using dynamic programming is performed using residues for the superimposed structures PHAR 201 Lecture 06, 2012
Test Case: Phycocyanin versus Colicin A PHAR 201 Lecture 06, 2012
Cyclin-dependent kinases Open (purple) Closed (blue) Pavelitch et al. (1997) PHAR 201 Lecture 06, 2012
Limitations • Will not find non-topological alignments (outside the bounds of the dotted lines) • What are the correct “units” to be comparing? • CE initially worked on chains – as we shall see in future weeks domains are the correct units, but definition of the domains is not straightforward PHAR 201 Lecture 06, 2012
Original Computation of All x All • Took 11,748 chains in the PDB (1/98) • Computed for 1868 representatives • 24,000 Cray T3E processor hours • Loaded pairwise alignments into • database PHAR 201 Lecture 06, 2012
Later • 40,000 proteins ~ 70,000 chains • 70,0002/2 * 30 seconds = 2330 yrs • Options: • Use a redundant set of chains • Use parallel architectures D. Pekurovsky, I.N. Shindyalov, P.E. Bourne 2004 High Throughput Biological Data Processing on Massively Parallel Computers. A Case Study of Pairwise Structure Comparison and Alignment Using the Combinatorial Extension (CE) Algorithm. Bioinformatics, 20(12) 1940-1947[PDF]. PHAR 201 Lecture 06, 2012
Now • Used Open Science Grid to compute all by all for Rigid FatCat and CE based on domains PHAR 201 Lecture 06, 2012
All by All Structural Alignment Objective: Identify novel architectures or unexpected structural similarities in the absence of sequence similarity. Representative chains from 40% sequence identity clusters are aligned with jFATCAT • Example: Green Fluorescent Protein GFP • Nidogen-1 (NID-1): similar 11-stranded • beta-barrel and internal helices • 3 Å RMSD, only 9% sequence identity • NID-1 component of basement membrane, no chromophore • GFP and NID-1 may share common ancestor
One Criteria for Redundancy • Remove highly homologous chains; • The RMSD between two chains is less than 2Å; • The length difference between two chains is less than 10%; • The number of gap positions in alignment between two chains is less than 20% of aligned residue positions; • At least 2/3 of the residue positions in the represented chain are aligned with the representing chain. PHAR 201 Lecture 06, 2012
Review example where structure comparison has revealed new biological insights PHAR 201 Lecture 06, 2012
CE revealed putative Ca++ binding domain in acetylcholinesterase Sequence similarity to neuroligins predicts Ca++ binding too – confirmed experimentally Members of the a/b hydrolase family bind Ca++ which may be important for heterologous cell associations Example Structural similarity between acetylcholinesterase and calmodulin found using CE (Tsigelny et al, Prot Sci, 2000, 9:180) PHAR 201 Lecture 06, 2012
The Future(also a general rule) • Gold standards are important • For structure comparison a human generated alignment standard is important • Algorithms are then challenged to meet the standard • Eventually those algorithms highlight problems with the standard • The cycle continues PHAR 201 Lecture 06, 2012