Tolga Can and Yuan-Fang Wang

CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features Tolga Can and Yuan-Fang Wang

Introduction • Importance of discovering structural relationships between proteins • Structural Alignment: NP-Hard • Protein structure representation: no standard as in sequence alignment • Many algorithms • Inter-atomic Distances (CE, DALI) • SSE vectors (VAST, 3D-Lookup) • Different similarity measures • RMSD, p-value, etc.

1l3l:C 2spc:A 1jig:A 1fse:A 1jig:A 1jek:B 1alu:_ 1l3l:C 1kzu:B 1k61:D 1wdc:A 1nkd:_ 1fmh:A 1gl2:A 1et1:A 1nkd:_ 1kzu:B Problem Definition • Given a protein structure, find similar protein structures from a database of protein structures. ? =

Protein Structure? We use Cα coordinates to represent the protein structure. HEADER PHEROMONE 20-DEC-95 2ERL .................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA .................................. ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547 ATOM 3 C ASP 1 -2.009 6.333 7.522 ATOM 4 O ASP 1 -1.467 6.394 8.624 ATOM 5 CB ASP 1 -1.526 6.993 5.163 ATOM 6 N ALA 2 -2.745 5.280 7.165 ATOM 7 CA ALA 2 -2.945 4.152 7.987 ATOM 8 C ALA 2 -1.606 3.448 8.305 ATOM 9 O ALA 2 -1.440 3.010 9.454 ATOM 10 CB ALA 2 -3.966 3.256 7.436 ATOM 11 N CYS 3 -0.777 3.267 7.329 ATOM 12 CA CYS 3 0.570 2.624 7.511 ATOM 13 C CYS 3 1.328 3.308 8.626 ATOM 14 O CYS 3 1.802 2.679 9.562 ATOM 15 CB CYS 3 1.351 2.667 6.209 ATOM 16 SG CYS 3 2.981 1.901 6.318 .................................. PDB File

Protein Structure The Cα coordinates of a protein define a curve in 3D space. HEADER PHEROMONE 20-DEC-95 2ERL .................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA .................................. ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547 ATOM 3 C ASP 1 -2.009 6.333 7.522 ATOM 4 O ASP 1 -1.467 6.394 8.624 ATOM 5 CB ASP 1 -1.526 6.993 5.163 ATOM 6 N ALA 2 -2.745 5.280 7.165 ATOM 7 CA ALA 2 -2.945 4.152 7.987 ATOM 8 C ALA 2 -1.606 3.448 8.305 ATOM 9 O ALA 2 -1.440 3.010 9.454 ATOM 10 CB ALA 2 -3.966 3.256 7.436 ATOM 11 N CYS 3 -0.777 3.267 7.329 ATOM 12 CA CYS 3 0.570 2.624 7.511 ATOM 13 C CYS 3 1.328 3.308 8.626 ATOM 14 O CYS 3 1.802 2.679 9.562 ATOM 15 CB CYS 3 1.351 2.667 6.209 ATOM 16 SG CYS 3 2.981 1.901 6.318 .................................. PDB File

Spline Approximation We smooth the Cα curve based on secondary structure information. HEADER PHEROMONE 20-DEC-95 2ERL .................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA .................................. ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547 ATOM 3 C ASP 1 -2.009 6.333 7.522 ATOM 4 O ASP 1 -1.467 6.394 8.624 ATOM 5 CB ASP 1 -1.526 6.993 5.163 ATOM 6 N ALA 2 -2.745 5.280 7.165 ATOM 7 CA ALA 2 -2.945 4.152 7.987 ATOM 8 C ALA 2 -1.606 3.448 8.305 ATOM 9 O ALA 2 -1.440 3.010 9.454 ATOM 10 CB ALA 2 -3.966 3.256 7.436 ATOM 11 N CYS 3 -0.777 3.267 7.329 ATOM 12 CA CYS 3 0.570 2.624 7.511 ATOM 13 C CYS 3 1.328 3.308 8.626 ATOM 14 O CYS 3 1.802 2.679 9.562 ATOM 15 CB CYS 3 1.351 2.667 6.209 ATOM 16 SG CYS 3 2.981 1.901 6.318 .................................. PDB File

Turn Spline Approximation We smooth the Cα curve based on secondary structure information. HEADER PHEROMONE 20-DEC-95 2ERL .................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA .................................. ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547 ATOM 3 C ASP 1 -2.009 6.333 7.522 ATOM 4 O ASP 1 -1.467 6.394 8.624 ATOM 5 CB ASP 1 -1.526 6.993 5.163 ATOM 6 N ALA 2 -2.745 5.280 7.165 ATOM 7 CA ALA 2 -2.945 4.152 7.987 ATOM 8 C ALA 2 -1.606 3.448 8.305 ATOM 9 O ALA 2 -1.440 3.010 9.454 ATOM 10 CB ALA 2 -3.966 3.256 7.436 ATOM 11 N CYS 3 -0.777 3.267 7.329 ATOM 12 CA CYS 3 0.570 2.624 7.511 ATOM 13 C CYS 3 1.328 3.308 8.626 ATOM 14 O CYS 3 1.802 2.679 9.562 ATOM 15 CB CYS 3 1.351 2.667 6.209 ATOM 16 SG CYS 3 2.981 1.901 6.318 .................................. PDB File Helix

Matching Two Curves Are they similar?

Curvature and Torsion • Curvature: • Torsion: Measure of how far the curve deviates from being planar Measure of how far the curve deviates from being linear • Fundamental Theorem of Space Curves: If two single-valued continuous functions (s)and (s) are given for s > 0, then there exists exactly one space curve, determined except for orientation and position in space (i.e., up to a Euclidian motion), where s is the intrinsic arc length,  is the curvature, and is the torsion.

Curvature and Torsion • They are invariant to rotation and translation. • They are localized. Curvature Torsion

Feature Extraction • For each amino acid a (Curvature, Torsion) tuple is computed and Secondary Structure assignment information from PDB web site is gathered • This constitutes a 3D feature vector of length n, where n is the number of amino acids in the protein Torsion Curvature + Secondary Structure Information (3rd dimension not shown above)

Indexing the Features • Why is indexing necessary? • Hash Table (show in 2D below, 3rd Dimension is the SSE type) A Hash Bin Torsion Curvature

Query Execution • Hierarchical approach: • Pruning before detailed pairwise alignment hash table • Accumulate vote • voteprotein++ • Normalize vote • voteprotein/lengthprotein • Threshold

Gap Query Execution • Pairwise alignment by Smith-Waterman dynamic programming technique performed after screening process: Distance Matrix SW 1l3l:C length:63 RMSD:1.61 Ao 1fse:A

SW Alignment Result 1fse:A 1l3l:C

Sample Query Results • Query: 1faz:A, database: 1938 protein chains • Screening time: 18 seconds • Pairwise Alignment time: 29 seconds 1faz:A & 1ytf:D 1faz:A & 1dj7:A length:38 RMSD:3.68 Ao length:42 RMSD:2.8 Ao

Sample Query Results • Query: 1b16:A, database: 1938 protein chains • Screening time: 25 seconds • Pairwise Alignment time: 68 seconds 1b16:A & 1h05:A 1b16:A & 1qp8:A length:35 RMSD:3.26 Ao length:35 RMSD:1.58 Ao

Current and Future Work • Evaluation of • Accuracy • Comparison with SCOP classification • Efficiency • Comparison with other techniques like CE, or DALI • Better index structures • Faster and more accurate screening of candidates • Incorporating biological, chemical properties of amino acids to the structure signatures of proteins.

Conclusions • A new method for protein structure alignment is presented: • Extracted structural features are: • Compact: O(n) • Localized: computed for each amino acid • Robust: error handling by spline approximation • Invariant: suitable for indexing • Meaningful: Biological, chemical properties can be incorporated easily • An indexing technique is deployed to avoid exhaustive scan of the structure database • Experiment results show that this method is suitable for finding structural motifs.

Thank you for your attention! For More Information: Tolga Can Department of Computer Science University of California at Santa Barbara Santa Barbara, CA 93106, U.S. Email: tcan@cs.ucsb.edu URL: http://www.cs.ucsb.edu/~tcan/CTSS/

Tolga Can and Yuan-Fang Wang