180 likes | 339 Views
MotifSpace. Mining Patterns in Protein Structures Algorithms and Applications. Wei Wang UNC Chapel Hill weiwang@cs.unc.edu. Proteins Are the Machinery of Life. Protein Structure Initiative. Function. Spatial motifs. Protein Data Bank. Serine protease. Papain-like Cysteine protease .
MotifSpace Mining Patterns in Protein StructuresAlgorithms and Applications Wei Wang UNC Chapel Hill weiwang@cs.unc.edu
Proteins Are the Machinery of Life Protein Structure Initiative Function Spatial motifs Protein Data Bank Serine protease Papain-like Cysteine protease GTP binding protein
MotifSpace protein classification Digital Library EC Protein Data Bank GO CATH SCOP User Input protein structures articles protein family Motif Filter Motif Miner Protein Classifier Knowledge Retriever Feature selection Association discovery spatial motifs Subgraph mining Classification Info retrieval Text mining family-specific motifs experimental knowledge Motif Navigator Visualization Spatial Motif Database Spatial Motif Knowledgebase Indexing & Search Knowledge management
Modeling a Protein by a Set of Points • Amino acids can be presented by points in a 3D space. ATOM 156 C GLY A 38 43.696 71.361 61.773 1.00 25.96 C ATOM 157 O GLY A 38 43.916 70.461 62.583 1.00 27.40 O ATOM 158 N HIS A 39 43.506 72.626 62.145 1.00 25.72 N ATOM 159 CA HIS A 39 43.583 73.021 63.550 1.00 22.52 C ATOM 160 C HIS A 39 42.367 73.829 63.983 1.00 19.35 C ATOM 161 O HIS A 39 41.790 74.562 63.187 1.00 20.24 O ATOM 162 CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C ATOM 163 CG HIS A 39 46.117 73.173 63.590 1.00 32.47 C ATOM 164 ND1 HIS A 39 46.786 72.533 64.612 1.00 34.50 N ATOM 165 CD2 HIS A 39 46.850 72.967 62.471 1.00 31.79 C ATOM 166 CE1 HIS A 39 47.875 71.961 64.129 1.00 36.40 C ATOM 167 NE2 HIS A 39 47.937 72.209 62.832 1.00 31.42 N ATOM 168 N LEU A 40 41.986 73.701 65.248 1.00 22.27 N ATOM 169 CA LEU A 40 40.851 74.468 65.724 1.00 21.68 C ATOM 170 C LEU A 40 41.226 75.942 65.709 1.00 23.21 C
Protein structures are chains of amino acid residues with certain spatial arrangements ASP102 HIS57 ALA55 SER195 ASP194 GLY43 GLY42 SER190 GLY40 Frequent subgraph mining: Given a group of proteins G each of which is represented by a graph and a support threshold 1≥ σ ≥ 0, find all maximal subgraphs which occurs in at least σ fraction of graphs in G node ↔ amino acid residue edge ↔ potential physical interaction Graph complexity Information Challenge: subgraph isomorphism (NP-complete)
Almost-Delaunay (AD) • A 4-tuple of points is almost-Delaunay with parameter , if, by perturbing all points in the set by at most , the circumscribing sphere can become empty. • A 4-tuple of points is AD() if is the minimal perturbation. Vertex can move within a sphere of radius R1 New tetrahedron may be formed due to the perturbation R4 R5 R2 Blue: Delaunay is AD(0) Red: is AD() R3 (Bandyopadhyay and Snoeyink, SODA, 2004)
AD(0.5) DT Graph Representations CD E(DT) E(AD) E(CD)
d Recurring patterns from Graph Databases Input: a database of labeled undirected graphs p2 p4 s1 q1 x b c x x c x c s2 q2 p1 y d d y d x x x x c c a c s3 q3 p5 p3 (S) (Q) (P) Output: All (connected) frequent subgraphs from the graph database. x y d 3/3 2/3 c c c c 3/3 3/3 c c x x c x y 2/3 d 3/3 y 2/3 d d x x c c c
p2 p4 x b c x p1 y d x > > x c a c y c p5 p3 x 0 a (P) d 0 x 0 b x c x x 0 0 d d x y c x c M3 p’2 p’4 0 0 x a x x y c a c 0 x 0 0 b x p’1 0 x b 0 M2 y d 0 0 x 0 a x x M1 b c p’5 p’3 (P’) Canonical Adjacency Matrix • The Canonical Adjacency Matrix(CAM) of a graph G is the maximaladjacency matrix for G under a total ordering defined on adjacency matrices. P3 P2 P5 P4 P1 P1 P2 P3 P4 P5 P1 P2 P3 P5 P4 dxcxyc0x0b00x0a > dxcxyc00xa0x00b > cycx0a0x0bxx00d
a b a b y b x b a a y b y b 0 x b y 0 b a y b y x b p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) CAM Tree: Frequent Subgraphs = 2/3
Fast Frequent Subgraph Mining • Spatial locality • Subgraphs with boundeddegree and size • Apriori property • any supergraph of an infrequent subgraph is infrequent • eliminates unnecessary isomorphism checks • Canonical form • Avoid redundant examination • Depth-first • Incremental isomorphism check • Better memory utilization • The state of the art algorithm that can handle large and complex protein graphs • Open issues • Substitution • Dynamics and geometric constraints
Proof of ConceptSerine Proteases Packing motifs identified in the Eukaryotic Serine Protease. N: total number of structures included in the data set. σ: The support threshold used to obtain recurring spatial motifs, T: processing time (in unit of second). M: motif number, C: the sequence of one-letter residue codes for the residue composition of the motif, κ: the actual number of occurrences of a motif in the family, λ, the background frequency of the motif, and S= -log(P) where the P-value defined by a hyper-geometric distribution. The packing motifs were sorted first by their support values in descending order, and then by their background frequencies in ascending order. The –log(P) values are highlighted
Proof of ConceptSerine Proteases 38 highly specific motifs mined from serine proteases classified by SCOP v1.65 (Dec 2003) 1HJ9 1MD8 1OP0 1OS8 1PQ7 1P57 1SSX 1S83
Proof of ConceptPapain-like Cysteine Protease All the patterns have –log(P) > 49,: support in the PCP family, : number of occurrences outside the family. Patterns that contain the active diad (His and Cys) of the proteins are highlighted.
Proof of ConceptPapain-like Cysteine Protease The active site in 1cqd Choi, K. H., Laursen, R. A. & Allen, K. N. (1999). The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36), 11624–33.
Proof of ConceptFunction Inference of Orphan Structure 1nfg 1m65 SCOP 51556 CASP5 T0147 unknown function no good sequence and global structure alignment to known proteins 7-stranded barrel fold, 30 motifs found Metallo-dependent hydrolase (MDH) 8-stranded ba (TIM) barrel fold 17 members, 49 family specific spatial motifs
Proof of ConceptFunction Inference II 1ecs 1twu SCOP 54598 Yyce Antibiotic resistance protein Glyoxalase / bleomycin resistance / dioxygenase superfamily 4 members (SCOP 1.65), 62 family specific spatial motifs unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004 46 motifs found, structurally similar to the three new non-redundant AR proteins added in SCOP 1.67
References and Acknowledgement • Collaborators • Catherine Blake (information retrieval) • Charlie Carter (biochemistry) • Nikolay Dohkolyan (biophysics) • Leonard McMillan (computer graphics) • Jan Prins (high performance computing) • Jack Snoeyink (computational geometry) • Alexander Tropsha (pharmacy) • Partially supported by • Microsoft eScience Applications Award • Microsoft New Faculty Fellowship • NSF CAREER Award IIS-0448392 • NSF CCF-0523875 • NSF DMS-0406381 • Prototype deployed at • Comparing graph representations of protein structure for mining family-specific residue-based packing motifs, Journal of Computational Biology (JCB), 2005. • SPIN: Mining maximal frequent subgraphs from graph databases, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 581-586, 2004. • Mining spatial motifs from protein structure graphs,. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004. • Accurate classification of protein structural families using coherent subgraph analysis, Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 411-422, 2004. • Efficient mining of frequent subgraph in the presence of isomorphism, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003. • Another 45 papers on general methodology development directly related to this project