340 likes | 472 Views
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs. Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel, Olivier Lichtarge, Lydia E. Kavraki. Motivation. Understanding the function of proteins is a fundamental purpose of biology.
E N D
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel, Olivier Lichtarge, Lydia E. Kavraki
Motivation • Understanding the function of proteins is a fundamental purpose of biology • Experimental determination of protein function is expensive and time consuming • Algorithms for computational function prediction could guide and accelerate protein function discovery process B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
A Computational Approach • Comparative Analysis • Focus: Algorithms for Comparative Analysis • What is similar about proteins with similar function? • Sequence – same components? • Geometry – same structure? • Dynamics – same motion? B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
A Computational Approach • Comparative Analysis • Focus: Algorithms for Comparative Analysis • What is similar about proteins with similar function? • Sequence – same components? • Geometry – same structure? • Dynamics – same motion? (Same Chemistry) B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
What do we need? • A motif for comparison • Representative of Biological function • An algorithm for comparison • Search for Geometric and Chemical similarity • Statistical analysis • Classification of results B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline • Evolutionary Trace (ET) • A source of biologically relevant motifs using evolutionary data • Match Augmentation (MA) • An algorithm for identifying geometric similarity • Statistical analysis • Statistically determined geometric thresholds B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Evolutionary Trace position 1 2 3 4 5 6 7 G T R I A C K G Y R I G C K G Y R L C C L A A A . E C W G T R L F C L G A K I Y C L G Y R I G C K A K R . D C W G T R L F C L A A A . E C W G A K I Y C L G T R I A C K A K K . D C W A K K . D C W A K R . D C W G Y R L C C L A K Y . E C W rank 4 A K Y . E C W consensus X - - X - CX rank 2 - - 4 - 13 The Evolutionary Trace (ET) Structure alignment + tree Functional site Lichtarge et al, JMB 1996; Lichtarge et al, JMB 1997; Lichtarge et al, PNAS 1996; Sowa et al, NSB 2001
ET Clusters Functionally Relevant Cluster Type Galectin CRD Dihydropteroate Synthase Trp1 domain of Hop Ligand binding site ET clusters Structural Epitope : Yellow = ligand, Blue = Residues within 5Å of the ligand ET Clusters : Yellow = ligand, Red = Largest Cluster, Other colors = trace residues http://imgen.bcm.tmc.edu/molgenlabs/lichtarge/trace_of_the_week/traces.html
Geometric Motifs • Trace Clusters are functionally relevant • A source for geometric motifs • Geometric Motifs Function • Given a protein structure: • Same Amino Acids • Same Geometry and Chemistry • Does the protein have the same function? ? B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline • Evolutionary Trace (ET) • A source of biologically relevant motifs using evolutionary data • Match Augmentation (MA) • An algorithm for identifying geometric similarity • Statistical analysis • Statistically determined geometric thresholds B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Geometric Comparison Algorithms • Geometric Hashing Wolfson H.J. et al. IEEE Comp. Sci. Eng., 4(4):10–21,1997. • JESS Barker J.A. et al. Bioinformatics, 19(13):1644-9, 2003. • PINTS Stark A. et al. Journal of Molecular Biology, 326:1307-16, 2003. • Many Others • webFEATURE, DALI, CE, SSAP… • Our method: Match Augmentation • Integrate Structural and Evolutionary data • Efficient application of hashing and depth first search B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs PSB 2005
Geometric Comparison Strategy • Biological Input: • A structure of a functional site (Motif) • A protein structure with unknown function (Target) • Geometric Search: • Find target atoms geometrically similar to motif atoms, similar atoms and amino acids (Match) • Output: • Match of atoms with greatest geometric similarity • Might potentially identify a similar functional site in the target B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Motifs: Structure & Evolution Data • Structure of a Functional Site • Points in three dimensions (3D) taken from atom coordinates (motif point) • Labeled by residue and atom identity • Alternate residues from mutation • Support for complex active sites • Priority-ranked motif points • Functional relevance {G,C,T} Ca 4 3 1 2 B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Input Data: Targets • Targets • Points in 3D taken from atom coordinates of whole protein structures (target points) • Labeled by residue and atom identity • No Alternate residues • No ranking {Y} Cb B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Search: Matching Criteria < e • Geometric Similarity • points are within e when optimally superimposed • Label Compatibility • Target residue label is a member of Alternative Residues • Atom labels identical Ca {S,L,T} Ca {S} B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Matches Motif • Matches correlate motif points to target points • Bijection • Fulfill Geometric and Label Criteria • Geometric Similarity measured by Least Root Mean Squared Distance (LRMSD) • The match we seek: • Bijection of all motif points • Smallest LRMSD of all matches considered Target Match B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance Input • Design Principle: • Correlate high ranking points first • Exhaustively test potential matches • Filter for the match with lowest LRMSD Seed Matching • Two Phases: • Seed Matching • Augmentation Augmentation Output B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance Input • Design Principle: • Correlate high ranking points first • Exhaustively test potential matches • Filter for the match with lowest LRMSD Seed Matching • Two Phases: • Seed Matching • Augmentation Augmentation 3 1 2 Match High Ranked Points Output B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance Input • Design Principle: • Correlate high ranking points first • Exhaustively test potential matches • Filter for the match with lowest LRMSD Seed Matching • Two Phases: • Seed Matching • Augmentation Augmentation 3 1 2 3 1 Output Expand matches to rest of Motif 2 B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance Input • Design Principle: • Correlate high ranking points first • Exhaustively test potential matches • Filter for the match with lowest LRMSD Seed Matching • Two Phases: • Seed Matching • Augmentation Augmentation 3 1 2 3 1 Output 4 Expand matches to rest of Motif 2 B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation at a Glance Input • Design Principle: • Correlate high ranking points first • Exhaustively test potential matches • Filter for the match with lowest LRMSD Seed Matching • Two Phases: • Seed Matching • Augmentation Augmentation 3 1 2 3 5 1 Output 4 Expand matches to rest of Motif 2 B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Filtering Completed Matches • Augmentation implements a depth first search: • Data is stored in heap of matches • Final output: match with smallest LRMSD LRMSD: 2.41 No more points to match Final Output LRMSD: 0.87 Matches Sorted by LRMSD B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation Summary • Hybrid Algorithm • Seed Matching: Hashing • Augmentation: Depth First Search • Finds matches to motifs within target structures • Final output: match with smallest LRMSD B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Testing MA on Biological data • Data Set • 12 motifs selected from residues surrounding enzymatic active sites • 73 targets, each evolutionarily related to one of the motifs • Details: www.cs.rice.edu/~brianyc/papers/PSB2005/ • Experimental Protocol • Search for each motif within every target. • Matches of evolutionarily related motif-target pairs are “HPs” (BLUE) • Matches of unrelated motif-target pairs are “NHPs” (RED) B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Match Augmentation Conclusions • Match Augmentation is accurate • Identifies cognate active sites in 95.4% of evolutionarily related proteins • Match Augmentation is very efficient • Matches can be found in a fraction of a second B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Outline • Evolutionary Trace (ET) • A source of biologically relevant motifs using evolutionary data • Match Augmentation (MA) • An algorithm for identifying geometric similarity • Statistical analysis • Statistically determined geometric thresholds B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Evaluating Statistical Significance • Hypothesis Testing Framework: • H0: Motif and Target are functionally unrelated • HA: Motif and Target are functionally related • Reject H0 for a given match only if the match is unusual under H0. • Problem: how do we evaluate the H0 for a given match? B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
The “Usual” H0 distribution • The set of matches between the motif and all functionally unrelated targets • Previous methods approximate this distribution: • JESS • Matches are compared to a reference population of motifs is partially ordered by degree of occurrence • PINTS • Approximate the distribution of matches with an artificial curve parameterized by motif size and residue content. • MA can calculate this distribution explicitly by computing matches to the entire PDB B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki • JESS: Barker J.A. et al. Bioinformatics, 19(13):1644-9, 2003. • PINTS: Stark A. et al. Journal of Molecular Biology, 326:1307-16, 2003. Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
A Distribution of Match LRMSDs Unsmoothed Smoothed • LRMSD distribution of matches with entire PDB • Almost all known protein structures • Almost no functional relation to a our motifs • Reasonable H0 Distribution Frequency Frequency 0 1 2 3 4 0 1 2 3 4 5 LRMSD LRMSD B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
How unusual is our match? • We want: the probability of observing a match with lower LRMSD than given match A: Area left of line matches with lower LRMSD B: Area under curve matches total Frequency p-Value: A B A p = B • Apply P-value to reject H0 LRMSD Match LRMSD B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Statistical Significance • Result: Data driven statistical significance value (p-value) • No dependence on approximations like previous work • p-value of a match tells us the probability of observing another match with lower LRMSD, with a functionally unrelated target • Apply p-value to reject H0 • Do matches identifying cognate active sites (HPs) have low p-values? (i.e. Can we reject H0 for HPs?) B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Testing our Statistical Analysis • Distributions of matches over the PDB can be calculated efficiently • 12:48 on a single machine, on average • Do not have to scan the entire PDB to accurately determine the H0 distribution • 5% random sample accurate enough • Reduces sample time to 0:38, on average • Matches of cognate active sites (HPs) are statistically significant (low p-values) B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Conclusions • Match Augmentation is accurate and extremely efficient • Correctly identifies cognate active sites (HPs) • Identifies matches in fractions of a second • Algorithmic efficiency enables detailed Statistical Analysis • Explicitly calculate H0 distribution without dependence on approximated H0 distributions • Matches of cognate active sites (HPs) are statistically significant • Significance threshold translates into useful motif-specific LRMSD thresholds B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
Special Thanks • Lichtarge Lab • David Kristensen • Dan Morgan • Ivana Mihalek • Hui Yao • Funding • NSF • NLM 5T15LM07093 • March of Dimes • Whitaker Foundation • Sloan Foundation • VIGRE • AMD • Kavraki Group • David Schwarz • Amarda Shehu • Allison Heath • Hernan Stamati • Anne Christian • Drew Bryant • Amanda Cruess • Brad Dodson • Jessica Wu • Kimmel Group • Viacheslav Fofanov B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs