1 / 54

Data Mining for Protein Structure Prediction

Data Mining for Protein Structure Prediction. Mohammed J. Zaki SPIDER Data Mining Project: S calable, P arallel and I nteractive D ata Mining and E xploration at R PI http://www.cs.rpi.edu/~zaki. Outline of the Talk. How do proteins form? Protein folding problem Contact map mining

dinah
Download Presentation

Data Mining for Protein Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration at RPI http://www.cs.rpi.edu/~zaki

  2. Outline of the Talk • How do proteins form? • Protein folding problem • Contact map mining • Using HMMs based on local motifs • Mining “physical” dense frequent patterns (non-local motifs) • Future directions • Heuristic rules • Folding pathways

  3. How do Proteins Form?

  4. How do Proteins Form? • Building Blocks of Biological Systems • DNA (nucleotides, 4 types): information carrier/encoder • RNA: bridge from DNA to protein • Protein (amino acids, 20 types): action molecules. • Processes • Replication of DNA • Transcription of gene (DNA) to messenger RNA (mRNA) • Splicing of non-coding regions of the genes (introns) • Translation of mRNA into proteins • Folding of proteins into 3D structure • Biochemical or structural functions of proteins

  5. Protein Folding Problem

  6. Protein Structures • Primary structure • Un-branched polymer • 20 side chains (residues or amino acids) • Higher order structures • Secondary: local (consecutive) in sequence • Tertiary: 3D fold of one polypeptide chain • Quaternary: Chains packing together

  7. Amino Acid

  8. Polypeptide Chain

  9. Torsion Angles

  10. The Protein Folding Problem

  11. Contact Map Mining

  12. Contact Map • Amino acids Ai and Aj are in contact if their 3D distance is less than threshold (7å) • Sequence separation is given as |i-j| • Contact map C is an N x N matrix, where • C(i,j) = 1 if Ai and Aj are in contact • C(i,j) = 0 otherwise • Consider all pairs with |i-j| >= 4

  13. Protein 2igd: 3D Structure Anti-parallel Beta Sheets Alpha Helix Parallel Beta Sheets

  14. Contact Map (2igd PDB) Parallel Beta Sheets Anti-parallel Beta Sheets Alpha Helix Amino Acid Aj Amino Acid Ai

  15. How much information in Amino Acids Alone: Classification Problem • A pair of amino acids (Ai,Aj) is an instance • The class: C (1) or NC (0), i.e., contact or non-contact • Highly skewed class distribution • 1.7% C and 98.3% NC; 300K C vs 17,3M NC • Features for each instance • Ai and Aj • Class: C or NC

  16. Predicting Protein Contacts • Predict contacts for new sequence

  17. Classification via Association Mining • Association mining good for skewed data • Mining: Mine frequent itemsets in C data (Dc) • P(X | Dc) = Frequency(X | Dc) / |Dc| • Counting: find P(X | Dnc) • Pruning • Likelihood of a contact: r = P(X|Dc) / P(X|Dnc) • Prune pattern X if ratio r of contact to non-contact probability is less then some threshold • i.e., keep only the patterns highly predictive of contacts

  18. Testing Phase • 90-10 split into training and testing • 2.4 million pairs, with 36K contacts (1.5%) • Evidence calculation: • Find matching patterns P for each instance • Compute cumulative frequency in C and NC • Sc = Sum of frequency (X | Dc) where X in P • Snc = Sum of frequency (X | Dnc) where X in P • Compute evidence: ratio of Sc / Snc • Prediction: Sort instances on evidence • Predict top PR fraction as contacts

  19. Experiments • 794 Proteins from Protein Data Bank • Distinct structures (< 25% similarity) • Longest: 907, Smallest: 35 amino acids • 90-10 split for training-testing • Total pairs: 20 million (> 2.5 GB) • Contacts: 330 thousand (1.6%) • Highly uneven class distribution

  20. Evaluation Metrics • Na: set of all pairs • Na*: all pairs with positive evidence • Ntc: true contacts in test data • Ntc*: true contacts with positive evidence • Npc: predicted contacts • Ntpc: correctly predicted contacts • Accuracy = Ntpc / Npc • Coverage = Ntpc / Ntc • Prediction Ratio (PR): Ntc*/Na* • Random Predictor Accuracy: Ntc/Na

  21. Results (Amino Acids; All Lengths) Crossover: 7% accuracy and 7% coverage; 2 times over Random

  22. Results (Amino Acids; by length) 1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C 170-300: 4.5% A and C; 300+: 2% A and C

  23. Using HMMs based on Local Motifs to Improve Classification

  24. An HMM for Local Predictions • HMMSTR (Chris Bystroff, Biology, RPI) • Build a library of short sequences that tend to fold uniquely across protein families: the I-Sites Library • Treat each motif as a Markov chain • Merge the motifs into a global HMM for local structure prediction

  25. Training the HMM • Build I-sites Library • Short sequence motifs (3 to 19) • Exhaustive clustering of sequences • Non-redundant PDB dataset (< 25% similarity) • Build an HMM • Each of 262 motifs is a chain of Markov states • Each state has sequence and structure for one position • Merge I-sites motifs hierarchically to get one global HMM for all the motifs

  26. HMM Output • Total of 282 States in the HMM • Each state produces or “emits”: • Amino acid profile (20 probability values) • Secondary structure (D) (helix, strand or loop) • Backbone angles (R) (11 dihedral angle symbols) • Finer structural context (C) (10 context symbols)

  27. I-Sites Motifs (Initiation Sites) Beta Hairpin Beta to Alpha Helix C-Cap

  28. Data Format and Preparation • Take the 794 PDB proteins • Compute optimal alignment to HMM • Find best state sequence for the observed acids • Output probability distribution of a residue over all the 282 HMM states • Integrate the 3 datasets • Alignment probability distribution (Nx282) • Amino acid and context information (D, R, C) • Contact map (NxN)

  29. HMMSTR Output (per Protein)

  30. Adding features from HMMSTR • The class: C (1) or NC (0) • Highly skewed class distribution • Approx 1.5% C and 98.5% NC • Features for each instance • Ai Aj Di Dj Ri Rj Ci Cj • Profile: pi1 pi2 … pi20 pj1 pj2 … pj20 • HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282 • Class: C or NC

  31. HMM and AA + (R,D,C) ; All Lengths Left Crossover: 19% accuracy and coverage; 5.3 times over Random Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random

  32. HMM + AA + R,D,C (by length) 1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C 170-300: 10% A and C; 300+: 6% A and C

  33. Predicted Contact Map (2igd)

  34. Summary of Classification Results • Challenging prediction problem • In essence, we have to predict a contact matrix for a new protein • Hybrid HMM/Associations approach • Best results to-date: 19% overall accuracy/coverage, 30% for short proteins • 14.4% Accuracy (Fariselli, Casadio ‘99; NN) • 13% Accuracy (Thomas et al ‘96) • Short proteins: 26% (Olmea, Valencia, ‘97)

  35. Mining “Physical” Dense Frequent Patterns (non-local motifs)

  36. Characterizing Physical, Protein-like Contact Maps • A very small subset of all contact maps code for physically possible proteins (self-avoiding, globular chains) • A contact map must: • Satisfy geometric constraints • Represent low-energy structure • What are the typical non-local interactions? • Frequent dense 0/1 submatrices in contact maps • 3-step approach: 1) data generation, 2) dense pattern mining, and 3) mapping to structure space

  37. Dense Pattern Mining • 12,524 protein-like 60 residue structures • Use HMMSTR to generate protein-like sequences • Use ROSETTA to generate their structures • Monte Carlo fragment insertion (from I-sites library) • Up to 5 possible low-energy structures retained • Frequent 2D Pattern Mining • Use WxW sliding window; W window size • Measure density under each window • (N-W)^2 / 2 possible windows per N length protein • Look for “minimum density”; scale away from diag • Try different window sizes

  38. Counting Dense Patterns • Naïve Approach:for W=5, N=60 there are 1485 windows per protein. Total 15 Million possible windows for 12,524 proteins • Test if two submatrices are equal • Linear search: O(P x W^2) with P current dense patterns • Hash based: O(W^2) • Our Approach: 2-level Hashing • O(W) time

  39. Pattern (WxW Submatrix) Encoding • Encode submatrix as string (W integers) Submatrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0 Concatenated String: 0.12.8.8.0

  40. Two-level Hashing • String ID (M) = • Level 1 (approximate): • Level2 (exact) : h2 (M) = StringID (M)

  41. Binding Patterns to Proteins Sequence and Structure • Using window size, W=5 • StringID:0.12.8.8.0, Support = 170 00000 01100 01000 01000 00000 • Occurrences: pdb-name (X,Y) X_sequence Y_sequence Interaction 1070.0 52,30 ILLKN TFVRI alpha::beta 1145.0 51,13 VFALH GFHIA alpha::strand 1251.2 42,6 EVCLR GSKFG alpha::strand 1312.0 54,11 HGYDE ATFAK alpha::beta 1732.0 49,6 HRFAK KELAG alpha::beta 2895.0 49,7 SRCLD DTIYY alpha::beta ...

  42. Frequent Dense Local Patterns

  43. Frequent Dense Non-Local Patterns Alpha – Alpha Alpha – Beta Sheet

  44. Frequent Dense Non-Local Patterns Alpha – Beta Turn Beta Sheet – Beta Turn

  45. Future Directions

More Related