1 / 42

Mining Patterns from Protein Structures

Mining Patterns from Protein Structures. Wei Wang University of North Carolina at Chapel Hill. Outline. Introduction Motivation Challenges Graph-based Pattern Discovery in Protein Structures Applications Conclusions Future Directions. Lys. Lys. Gly. Gly. Leu. Val. Ala. His.

arnie
Download Presentation

Mining Patterns from Protein Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Patterns from Protein Structures Wei Wang University of North Carolina at Chapel Hill

  2. Outline • Introduction • Motivation • Challenges • Graph-based Pattern Discovery in Protein Structures • Applications • Conclusions • Future Directions

  3. Lys Lys Gly Gly Leu Val Ala His Oxygen Nitrogen Carbon Sulfur Ribbon Introduction • Protein • A sequence from 20 amino acids • Adopts a stable 3D structure that can be measured experimentally

  4. Serine protease active center 1HJ9 1R64 1SSX Introduction • Structure patterns are geometric arrangements of amino acids that are common to a group of different proteins. Three proteins with the same function

  5. Motivations • Structure patterns are useful in: • Protein structure alignment • Protein design • Prediction of protein-protein interactions • Understanding protein folding • Drug design

  6. Goal • Develop techniques to discover structure patterns that are • Efficient • Effective

  7. Growth of Known Structures in Protein Data Bank 35,000 The total number of known protein structures Newly characterized proteins in that year # of structures 1988 2005 Year Challenges • Define mathematical models to represent protein structures • Point set • Labeled graph • Define computational components • Define structure pattern • Specify a matching condition • Design a search procedure • Evaluate the results • computational efficiency and effectiveness

  8. …. The Nature of Protein Structure Data • The ball-stick model is an element-based structure representation • A structure is decomposed into a set of amino acids • Proteingeometry,topology,andattributesare defined with respect to the amino acid set

  9. Components of Pattern Discovery • The definition of patterns • Geometry vs. topology • The matching condition • Measures the fitness of a pattern to a set of protein structures • The search procedure

  10. Related Work Protein Local Structure Comparison Problem Pattern Discovery Pattern Matching • ASSAM, Artymiuk et al., JMB’94 • TESS, Wallace et al., Prot. Sci. ‘97 Sequence-dependent Sequence-independent • TRILOGY, Bradley et al., RECOMB’01 Multi-way comparison Pair-wise comparison • PINTS, Russell, JMB’98 • Geometric Hashing, Fischer et al., Prot. Sci.’94 • Graph Matching, Schmitt et al., JMB’02 • Evolutionary Trace, Lichtarge et al., JMB’96 • FFSM & its variants, Huan et al., ICDM’03, RECOMB’04, CSB’06 Huan et al. Advances in Computers

  11. Our Approach A group of protein structures Represent each structure as a labeled graph Discover frequent occurring subgraphs Map subgraphs to protein structures and obtain structure patterns Predict protein function Identify functional sites in proteins Discover patterns in structure evolution

  12. Outline • Introduction • Graph-based Pattern Discovery in Protein Structures • Labeled graphs and representing structures as labeled graphs • Frequent subgraph mining • Applications • Conclusions • Future Directions

  13. p5 p2 y c b y p1 x a y y d b p4 p3 G1 q1 s1 s4 y b c y y b s2 q2 a a x y y b b s3 q3 G3 G2 Labeled Graphs • A labeled graph is a graph where each node and each edge has a label.

  14. Protein Contact Map • Use a labeled graph to represent a protein structure • Nodesrepresent amino acids,labeled by theidentityof the amino acids • Edgesconnect two amino acids if their Euclidian distance is less than a certain threshold Contact A protein

  15. p5 p2 s1 s4 y y c b b c y y s2 p1 x a a y y y q1 b d b y b s3 p4 p3 q2 G3 G1 a x g2 g3 y y y c b a b g1 q3 G2 G Pattern Matching • A graph G is subgraph isomorphic to a graph G’, denoted by G  G’, if • there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping. • A pattern is a graph. Pattern Gmatches G’ if G  G’ • Goccurs in G’ if G  G’. • With a label set, a graph space is a collection of graphs whose labels are from the set.

  16. Subgraph Mining: Notations Cont. • The support value of a pattern P in a collection of graphs G is the fraction of graphs in G where Poccurs. • Given a collection of graphs G and a threshold 0 <   1, the frequent subgraph mining problem is the identification of all patterns that have support at least .

  17. p5 p2 y c b y p1 x a y y d b p4 p3 G1 y y b c b q1 s1 s4 y b P3 b P2 y b c y y b y + s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b + + P6 P5 s3 q3 + P4 G3 G2 Examples The induced subgraph isomorphism penalizes any unmatched edges  = 2/3 b y f=2/3 f=0/3 f=2/3 f = 1/3 f = 3/3 a y b P1 +: induced frequent subgraphs

  18. p5 p2 y c b y p1 x a y y d b p4 p3 G1 b y y y b a c b y q1 s1 s4 b y b P1 b P2 y b c y y b y s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b P6 P5 s3 q3 P4 G3 G2 Examples Maximal frequent subgraph are ones that none of their supergraphs are frequent Other criteria for selecting subgraphs may be incorporated  = 2/3 f=2/3 ! P3 !: Maximal frequent subgraphs

  19. Search DAG • Task: identify all frequently occurring subgraphs from a group of graphs, or a graph database • Support anti-monotonicity • Any supergraph of an infrequent subgraph is infrequent • Known as the Apriori property • Level-wise search • Keep all patterns with the same size in memory (poor memory utilization) • Depth-firstsearch • Better memory utilization • May repeatedly search patterns in the DAG (redundant candidates)

  20. Related Work • Level-wise search • AGM: Inokuchi et al., PKDD’00 • FSG: Kuramochi & Karypis, ICDM’01 • Depth-first search • gSpan, Yan & Han, ICDM’02, KDD’03 • FFSM, Huan et al., ICDM’03 • Path-based search • Vanetik, et al., ICDM’02, ICDE’04 • GASTON: Nijssen & Kok, KDD’04 • Tree-based search • SPIN, Huan et al., SIGKDD’04 • Mining with constraints • CSM, Huan et al., CSB’06

  21. The Fast Frequent Subgraph Mining (FFSM) Overview • Graph normalization • Graph Canonical Adjacency Matrix Tree (CAM Tree) • Incremental subgraph isomorphism test Huan et al. ICDM 2003

  22. An arbitrary set  Intuitions for Graph Normalization A Graph Space A partial order defined on the graph space A 1-1 mapping A partial order defined on 

  23. Graph Normalization • With a partially ordered set (, ),φ: G* →  that maps a graph space G* to  is a graph normalization function if φ is a 1-1 mapping. • (mapping partial orderφ) Given a graph normalization φandits codomain(, ), we define a binary relation φ G*  G* such that P φQ if φ(P) φ(Q) • Claim: φis a partial order

  24. Ideal Normalization • Given a partially ordered codomain (, ),a normalization functionφ: G* → is an ideal normalization if • φinduces a search tree (No redundant candidates) • φ is a subset of the subgraph relation, i.e. for all graphs P and Q, P φQ implies PQ (anti-monotonicity of support )

  25. p’2 P1 P2 P3 P4 P1 P2 P4 P3 P1 P4 P2 P3 b x p’1 y a x a a a x c b x x 0 b c b p’4 p’3 0 x x x y x b b c M1 M3 M2 (P’) y 0 0 x x 0 0 y x b b c p2 p4 x c b x p1 y a x b p3 (P) Graph Canonical Code • The Canonical Code (θ)maps a graph G to a string. • Claim:θ: G* → (*, ) is a graph normalization θ: G* → (*, ) is an ideal graph normalization Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) < Code(M2):(1, 1, a)(2, 1, x) (2, 2, b) (3, 2, x) (3, 3, c) (4, 1, x) (4, 2, y) (4, 4, b) < Code(M3): (1, 1, a)(2, 2, c) (3, 1, x) (3, 2, x) (3, 3, b) (4, 1, x) (4, 3, y) (4, 4, b) θ(P) = (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) • (i, j, Mi,j)  (k, l, Mk,l) if • i < k, or • i = k, j < l, or • i =k, j = l, Mi,j  Mk,l

  26. FFSM Search • Task: identify all frequently occurring subgraphs from a family of graphs • Depth-firstsearch • Better memory utilization • Apriori property • Eliminate unnecessary isomorphism checks • Graph normalization • Avoid redundant examination • Subgraph isomorphism test is NP-complete • Incremental isomorphism check • Applies to frequent induced subgraph mining with minor modifications +

  27. O = _ _ C C C Performance of FFSM Running time (s) PTE (Predictive Toxicology Evaluation) data set • Contains 340 chemicals • Performances were collected from literatures where experiments were performed with different hardware configurations (400Mhz PIII to 2GHz PIV) • Software downloadable from http://www.cs.unc.edu/~huan • AGM: Inokuchi et al. PKDD’00 • FSG: Kuramochi & Karypis, ICDM’01 • gSpan: Yan & Han, ICDM’02 • FFSM: Huan et al. ICDM’03 • Gaston: Nijssen & Kok, KDD’04

  28. FFSM Scalability Running time (s) Serine protease: • Contains 40 proteins • Contact is defined between every pair of distinct residues if the distance between their C atoms is less than a certain upper-bound (e.g. 6.5 angstrom) • Performances were measured in a single 2GHz PIV CPU with 2GB main memory • gSpanhandles graphs with no more than 254 edges • Gaston runs out of memory

  29. Outline • Introduction • Graph-based Pattern Discovery in Protein Structures • Applications • MotifSpace Architecture • Identify functional sites in proteins • Predict protein function • Conclusions • Future Directions

  30. Effectiveness • Serine proteases have three subclasses • Subtilisins • Eukaryotic serine proteases • Prokaryotic serine proteases 1HJ9 1R64 1SSX

  31. Frequent Patterns • 20 highly specific patterns mined from serine proteases # of patterns is the total number of fingerprints a protein has. The coverage of a protein is the fraction of residues which are covered by at least one fingerprint (%), Length (of the protein) is displayed in unit of 200 residues

  32. Patterns’ Biological Relevance 1HJ9 1MD8 1OP0 1OS8 1PQ7 1P57 1SSX 1S83

  33. More Case Studies • Papain-like cysteine proteases • Nuclear receptor ligand binding domains • NADP/FAD binding proteins Papain-like cysteine protease Nuclear Binding domains NADP binding proteins

  34. Predict Protein Function How does a protein function in a biological system? Function Functional motifs carry out protein function 3D structure of a protein

  35. Abr. Name #M #P #M: number of members in a family #P: number of patterns obtained from the family Distinguishing Families with Different Function • TIM barrel Fold contains many proteins with similar structures but different functions Bandyopadhyay, Huan et al. Prot. Sci. ‘06

  36. Functional Inference for 1TWU 1ecs 1twu Yyce SCOP 54598 Antibiotic resistance protein Glyoxalase / bleomycin resistance / dioxygenase superfamily 4 members (SCOP 1.65), 62 family specific spatial motifs unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004 46 motifs found, structurally similar to the three new non-redundant AR proteins added in SCOP 1.67

  37. G O C A T H S C O P MotifSpace Architecture Biological Experiments Protein Data Bank testable hypotheses Experimental validation protein structures protein family Pattern Filter Pattern Miner Protein Classifier Pattern Validation Subgraph mining Visualization Classification Feature selection structure patterns family-specific patterns Structure Pattern Database Functional Motifs Knowledgebase Indexing & Search Knowledge management Huan et al. ISMB’05 demo, http://escience2-cs.cs.unc.edu/Default.aspx

  38. Summary Goal: pattern discovery in protein structures • Develop labeled graph representations for protein structures • Design algorithms to identify recurring subgraphs in a collection of graphs • Frequent, constrained, maximal, or coherent subgraph mining • Performance evaluation on various data sets • Collaborate with domain experts to evaluate the utility of the algorithms • Predict function for protein structures • Identify structure patterns in protein fold families

  39. Future Work • Pattern discovery in protein structures • Approximate pattern discovery • More applications: • Protein-protein interaction • Protein subcellular localization

  40. Complex Data in Biology Data Models Biological Data Volume

  41. Biological systems at the molecular level Data Analysis in Biological Systems • Challenges: • What are the nature of the data from biological systems? • What are the computational tasks? • How to divide the tasks into a group of computational components? • How to evaluate the results? Source: http://bioinformatics.ca/workshop_pages/bioinformatics/

  42. Acknowledgements • Collaborators: Charlie Carter (UNC School of Medicine), Nikolay Dokholyan (UNC School of Medicine),Leonard McMillan, Jan Prins, Jack Snoeyink,Alexander Tropsha (UNC School of Pharmacy) • Students: Deepak Bandyopadhyay, Yetian Chen (UNC School of Pharmacy), Jun Huan, Jinze Liu, Ruchir Shah (UNC School of Pharmacy), Kiran Sidhu, Xueyi Wang, David Williams, Tao Xie, Jingdan Zhang

More Related