140 likes | 339 Views
Application of Protein structure prediction software for EST Data. Adrian Laurenzi Departments of Computer Science and Biology Undergraduate Research Symposium May 20 th , 2011. What are Ests ?. Image credit: compbio.dfci.harvard.edu. Sequencing of expressed genes.
E N D
Application of Protein structure prediction software for EST Data Adrian LaurenziDepartments of Computer Science and Biology Undergraduate Research Symposium May 20th, 2011
What are Ests? Image credit: compbio.dfci.harvard.edu Sequencing of expressed genes • Expressed sequence tags (ESTs) are useful for: • discovering new genes • understanding gene expression and regulation • Over half of GenBank entries are ESTs (nearly 70 million from over 2000 different organisms) • Typically annotated using sequenced-based techniques • Accurate annotation relies on finding a homologous sequence of known function atgc.org Assembly ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTTA… DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK… PSI-BLAST Match to homologous proteins with known function
Can we reliably predict the partial structures of proteins encoded by EST sequences? EST contig(shorter-than-full length protein) Structure prediction Accurate partial structure? ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTT DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK
Structure-based functional annotation Protein structure determines function!Example of:Human carbonic anhydrase Many structure-based functional annotation tools available PDB ID: 1CA2 Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Image credits: Protein Data Bank (PDB); UCSF Chimera
Structure prediction software • relies upon identification of a good modeling template (known structure) • without a good template the modeling is doomed to fail • more accurate than ab initio when a homologous template is available Two general categories of structure prediction software: Ab initio (Rosetta 3.2) Comparative modeling(ProtinfoCM) • generally more computationally intensive • pure ab initio typically only used when no templates (from PDB) are found to be homologous to target sequence • predictions for sequences > 100 amino acids are not very reliable
Benchmarking dataset Full-length sequence: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN Slices: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGH LTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRN PSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFR VGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN SYIKPLPSGDFIVKALTPVDAFND Image credits: Protein Data Bank (PDB); UCSF Chimera
“Best” Model selection Optimization • Assign a score to all models (lower score = better model) then cluster top 10% • Optimization: • K-means vs. Rosetta hierarchal clustering • RAPDF vs. Rosetta scoring Low energy score (good model) High energy score (bad model) Image credits: UCSF Chimera Rosetta 3.2
Acknowledgements Thanks to the Baker Lab for making Rosetta available! • Ram Samudrala, PI Computational Biology Group, Department of Microbiology • The Group: • Ling-Hong Hung • Mike Shannon • Mike Zhou • Stewart Moughon • George White • Brian Buttrick • Jeremy Horst • Thomas Wood • Raymond Zhang Research supported by: • Levinson Emerging Scholars Program • Mary Gates Research Endowment • NSF REU grant Special thanks to: • Art and Rita Levinson • All the staff at the Undergraduate Research Program • Thank you for all your incredible support!
Model quality assessment • Not a trivial problem, no widely-accepted best method • Two similar but independent popular algorithms used:TMscore & MaxSub • Basic method: • Optimal alignment of Cα atoms • TMscore: considers all residues but weights better-aligning higherMaxSub: find largest subset of model residues that superimpose well upon corresponding residues TMscore = 0.28 TMscore = 0.60 Image credits: TMscore; RasMol
Software overview ProtinfoCM & MODELLER (comparative modeling) Have animated slide in “tool tips” about specifics of Protinfo & MODELLER (Eswaret al. 2006)
Software overview RosettaAB (ab initio modeling) Image credit: depts.washington.edu/yeastrc/pages/rosetta.html Build fragment library: find set of sequence segments (< 10 residues) using input sequence Model generation: assemble fragments; after each insertion do energy minimization step Selection of “best” model:score & cluster 1,000-10,000 generated models Image credits: UCSF Chimera Rosetta 3.2
Running predictions • Must be careful not to cheat! • Comparative modeling:templates < 85% identities • RosettaAB:No fragments homologousto target were used • Ab initio predictions run in parallel across computer cluster