1 / 14

Application of Protein structure prediction software for EST Data

Application of Protein structure prediction software for EST Data. Adrian Laurenzi Departments of Computer Science and Biology Undergraduate Research Symposium May 20 th , 2011. What are Ests ?. Image credit: compbio.dfci.harvard.edu. Sequencing of expressed genes.

gomer
Download Presentation

Application of Protein structure prediction software for EST Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Protein structure prediction software for EST Data Adrian LaurenziDepartments of Computer Science and Biology Undergraduate Research Symposium May 20th, 2011

  2. What are Ests? Image credit: compbio.dfci.harvard.edu Sequencing of expressed genes • Expressed sequence tags (ESTs) are useful for: • discovering new genes • understanding gene expression and regulation • Over half of GenBank entries are ESTs (nearly 70 million from over 2000 different organisms) • Typically annotated using sequenced-based techniques • Accurate annotation relies on finding a homologous sequence of known function atgc.org Assembly ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTTA… DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK… PSI-BLAST Match to homologous proteins with known function

  3. Can we reliably predict the partial structures of proteins encoded by EST sequences? EST contig(shorter-than-full length protein) Structure prediction Accurate partial structure? ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTT DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK

  4. Structure-based functional annotation Protein structure determines function!Example of:Human carbonic anhydrase Many structure-based functional annotation tools available PDB ID: 1CA2 Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Image credits: Protein Data Bank (PDB); UCSF Chimera

  5. Structure prediction software • relies upon identification of a good modeling template (known structure) • without a good template the modeling is doomed to fail • more accurate than ab initio when a homologous template is available Two general categories of structure prediction software: Ab initio (Rosetta 3.2) Comparative modeling(ProtinfoCM) • generally more computationally intensive • pure ab initio typically only used when no templates (from PDB) are found to be homologous to target sequence • predictions for sequences > 100 amino acids are not very reliable

  6. Benchmarking dataset Full-length sequence: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN Slices: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGH LTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRN PSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFR VGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN SYIKPLPSGDFIVKALTPVDAFND Image credits: Protein Data Bank (PDB); UCSF Chimera

  7. Accuracy on all “Slices”:Ab initio (Rosetta 3.2) BAD

  8. Accuracy on all “Slices”:comparative modeling (protinfo cm)

  9. “Best” Model selection Optimization • Assign a score to all models (lower score = better model) then cluster top 10% • Optimization: • K-means vs. Rosetta hierarchal clustering • RAPDF vs. Rosetta scoring Low energy score (good model) High energy score (bad model) Image credits: UCSF Chimera Rosetta 3.2

  10. Acknowledgements Thanks to the Baker Lab for making Rosetta available! • Ram Samudrala, PI Computational Biology Group, Department of Microbiology • The Group: • Ling-Hong Hung • Mike Shannon • Mike Zhou • Stewart Moughon • George White • Brian Buttrick • Jeremy Horst • Thomas Wood • Raymond Zhang Research supported by: • Levinson Emerging Scholars Program • Mary Gates Research Endowment • NSF REU grant Special thanks to: • Art and Rita Levinson • All the staff at the Undergraduate Research Program • Thank you for all your incredible support!

  11. Model quality assessment • Not a trivial problem, no widely-accepted best method • Two similar but independent popular algorithms used:TMscore & MaxSub • Basic method: • Optimal alignment of Cα atoms • TMscore: considers all residues but weights better-aligning higherMaxSub: find largest subset of model residues that superimpose well upon corresponding residues TMscore = 0.28 TMscore = 0.60 Image credits: TMscore; RasMol

  12. Software overview ProtinfoCM & MODELLER (comparative modeling) Have animated slide in “tool tips” about specifics of Protinfo & MODELLER (Eswaret al. 2006)

  13. Software overview RosettaAB (ab initio modeling) Image credit: depts.washington.edu/yeastrc/pages/rosetta.html Build fragment library: find set of sequence segments (< 10 residues) using input sequence Model generation: assemble fragments; after each insertion do energy minimization step Selection of “best” model:score & cluster 1,000-10,000 generated models Image credits: UCSF Chimera Rosetta 3.2

  14. Running predictions • Must be careful not to cheat! • Comparative modeling:templates < 85% identities • RosettaAB:No fragments homologousto target were used • Ab initio predictions run in parallel across computer cluster

More Related