Uğur Sezerman

Automatic Function Identification Using the Network Properties Obtained from Graph Representation of Proteins Uğur Sezerman

MOTIVATION • Common biological function=similar 3D structures • Comparison of graphs to find similar sub graphs • Discovering Native folds and differentiation from artificially generated proteins • Finding functional domains • Finding structural motifs for function

Background Graph Matching Algorithms One isomorphism between them is f(a)=1, f(b)=6, f(c)=2, f(d)=4, f(e)=5, f(f)=3. * J. R. Ullmann, An Algorithm for Subgraph Isomorphism, Journal of the Association for Computing Machinery, vol. 23, pp. 31-42, 1976 ** D.C. Schmidt, L.E. Druffel, A Fast Backtracking Algorithm to Test Directed Graphs forIsomorphism Using Distance Matrices, Journal of the Association for ComputingMachinery, 23, pp. 433-445, 1976.

INEXACT SUBGRAPH MATCHING Allow for : • Mismatching attribute values (mutations) • Missing nodes (amino acid deletions and/or insertions) • Missing links (contact changes due to conformational rearrangements) Also called error-correcting subgraph isomorphism NP-Complete

Representation Methods of Graphs • Delaunay Tesellated graphs • Contact maps

Delaunay simplex is defined by points, whose Voronoi polyhedra have common vertex. Delaunay simplex is always a triangle in a 2D space and a tetrahedron in a 3D space. (Voronoi polyhedra may have different # of faces and edges.) Voronoi Tessellation Delaunay Tessellation Voronoi/Delaunay Tessellation in 2D

Delaunay Simplices* *Taylor T., Vaisman I.I.: Graph theoretic properties of networks formed by the Delaunay tessellation of protein structures. Phys. Rev. E. Stat. Nonlin. Soft. Matter Phys.73 (2006) 041925

Contact Maps1,2 • Modelling protein structure as graph • N×N matrix S • distance between Cα atoms < 6.8 Ao3 • Si,j = 1 otherwise Si,j = 0 1. Vendruscolo, M., E. Kussel, and E. Domany: Recovery of Protein Structurefrom Contact Maps. Structure Fold. Des. 2 (1997) 295-306. 2. Fariselli, P. and R. Casadio: A Neural Network Based predictor of Residue Contacts in Proteins. Protein Eng. 9 (1996) 941-948. 3. A. R. Atilgan, P. Akan, C. Baysal: Small-World Communication of Residues and Significance for ProteinDynamics. Biophys. J. 86 (2004) 85-91

Graph Theoretical Attributes • (k) Connectivity= # of neighbours • (C) Cliquishness= # of contacts between neighbours(d) / All possible contacts between them • S(k) Second Connectivity= sum of the connectivity values of all neighbours for a node.

Centrality Measures d: Degree Matrixσ: Shortest Path Matrix

Establishing Bases of Applications • Potential Use of Graph TheoreticalProperties of Protein Structures in Structural Alignment

Network Properties in Structural Alignment • Calculated the difference between the network property values of the CE aligned residues of two protein structures. • Then checked to see whether such a difference could be obtained randomly.

Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. CE Algorithm, version 1.00, 1998. Chain 1: pdbdir/12AS.pdb:A (Size=330) Chain 2: pdbdir/1PYS.pdb:A (Size=350) Alignment length = 211 Rmsd = 3.45A Z-Score = 5.3 Gaps = 125(59.2%) CPU = 15s Sequence identities = 14.2% Chain 1: 9 QRQISFVKSHFSRQLEERLGLIEVQAPILSR Chain 2:100 LHPITLMERELVEIFRAL-GYQAVEGPEVES CE Alignment Table :Calculated parameter Values Table:Part of aCE Alignment result between the chain A of 12AS and the chain A of 1PYS. Calculated values for each graph theoretical property for the bold part is in Table 1 as an example.

Randomness Check • Shuffling Method • Preserved the network values of the first protein and randomly shuffled the existing network values in the second protein. • Shifting Method • we basically shifted the network values of the second protein randomly while keeping the values of the first protein • These procedures are repeated 1000 times

Data Sets • Caprioti * data Set: This data set contains structurally similiar proteins which have very low sequence similarity. • Astral 40 data set: 3064 pairs are randomly chosen from database of structural similar proteins with low sequence identity. * Capriotti,E., Fariselli,P., Rossi,I. and Casadio,R. ( (2004) ) A Shannon entropy-based filter detects high-quality profile-profile alignments in searches for remote homologues. Proteins, , 54, , 351–360.

TABLE II The Results From Randomly Shuffled Method (Capriotti Dataset: 158 Pairs) TABLE III The Results From Shifted Method (Capriotti Dataset: 158 Pairs)

TABLE IV The Results From Randomly Shuffled Method (Astral 40 Dataset: 3064 Pairs) TABLE VThe Results From Shifted Method (Astral 40 Dataset: 3064 Pairs)

TABLE VI Z-Scores For Some Example Pairs From Randomly Shuffled Method (Astral 40 Dataset)

TABLE VII Z-Scores For Some Example Pairs From Shifted Method (Astral 40 Dataset)

Conclusion • 67 protein pairs can not be explained over 3064 protein pairs, because their structural similarities are also too low. TABLE IXThe best combination of the properties, the last column shows the amount of the non-explained pairs

Application I: Structural Alignment Table 1. Graph Theoretical Properties • Global and Local Alignment of protein structures using graph theoretical properties. • We used nine different properties. (Table 1) • Affine gap penalty is used for alignment. • Distance Function:

Comparison of Global Alignment Results with CE

Comparison of Local Alignment Results with CE

Application II • Finding functional domains • Functional similarity does not imply sequence similarity. • Two proteins with very low sequence similarity can have same function which shows importance of structure similarity.

Selected Attributes • Degree • Clustering Coefficient • Secondary Structure Similarity • Sequence Similarity (Blossum 62)

Data Set • Data set created by Capriotti et. al.(2004)* • This data set contains structurally similiar proteins which have very low sequence similiarity. • Chosen Globins family to extend results * Capriotti,E., Fariselli,P., Rossi,I. and Casadio,R. ( (2004) ) A Shannon entropy-based filter detects high-quality profile-profile alignments in searches for remote homologues. Proteins, , 54, , 351–360.

Our Approach • Contact map graphs for proteins are built. • In our approach, we are using four dimensions. These are cliquishness, connectivity, sequence similarity and secondary structure. • PAM250 Matrix is used for sequence similarity. • The secondary structure similiarity score is calculatedby a similiarity matrixclaimedby Wallqvist et. al.* • if cliquishness, connectivity and second connectivity values are close according to intervals we specified, the matchis awarded else, the match is penalized. *Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM. Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases. Bioinformatics. 2000 Nov;16(11):988-1002.

Our Approach • PDB files are parsed and correlation coefficient, degree values are calculated for each residue. • Those values with binding information are put into a matrix which is called “Binding residue matrix” • The initial nodes are chosen among the most heavily connected nodes. • Binding residue matrix and an initial node are sent to each processor to begin its operation.

Results-Globins- Self Match I

Results-Globins- Self Match II

Self Matching 24 Pairs of Domains

Questions • Thank you • ugur@sabanciuniv.edu

Results-Globins- Self Match IV

Results-Globins-Sub Cross Match

Results (Globins Gen. I) * Different parameters were used to extend the results.

Results (Globins Gen. II) * Different parameters were used to extend the results.

Dataset* I *Dataset was created by Capriotti et. al.(2004)

Dataset* II *Dataset was created by Capriotti et. al.(2004)

Dataset* III *Dataset was created by Capriotti et. al.(2004)

Dataset* IV *Dataset was created by Capriotti et. al.(2004)

Dataset* V *Dataset was created by Capriotti et. al.(2004)

Uğur Sezerman

Uğur Sezerman

Presentation Transcript

Molecular Docking

Overview of Academic Requirements for M.S. and Ph.D. Degrees