Application of Protein structure prediction software for EST Data

Application of Protein structure prediction software for EST Data Adrian LaurenziDepartments of Computer Science and Biology Undergraduate Research Symposium May 20th, 2011

What are Ests? Image credit: compbio.dfci.harvard.edu Sequencing of expressed genes • Expressed sequence tags (ESTs) are useful for: • discovering new genes • understanding gene expression and regulation • Over half of GenBank entries are ESTs (nearly 70 million from over 2000 different organisms) • Typically annotated using sequenced-based techniques • Accurate annotation relies on finding a homologous sequence of known function atgc.org Assembly ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTTA… DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK… PSI-BLAST Match to homologous proteins with known function

Can we reliably predict the partial structures of proteins encoded by EST sequences? EST contig(shorter-than-full length protein) Structure prediction Accurate partial structure? ACTACAAAGTAAGAAGAACAAAGTAGTAAAACAAATAATTAGTT DQTNRKVTKRGPLTNMEGNEKKGGGLPPTQQRHLQSSKQSSKK

Structure-based functional annotation Protein structure determines function!Example of:Human carbonic anhydrase Many structure-based functional annotation tools available PDB ID: 1CA2 Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Amino acid sequence: SHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPPLLECVTWIVLKEPISVSSEQVLKFRKLNFNGEGEPEELMVDNWRPAQPLKNRQIKASFK Image credits: Protein Data Bank (PDB); UCSF Chimera

Structure prediction software • relies upon identification of a good modeling template (known structure) • without a good template the modeling is doomed to fail • more accurate than ab initio when a homologous template is available Two general categories of structure prediction software: Ab initio (Rosetta 3.2) Comparative modeling(ProtinfoCM) • generally more computationally intensive • pure ab initio typically only used when no templates (from PDB) are found to be homologous to target sequence • predictions for sequences > 100 amino acids are not very reliable

Benchmarking dataset Full-length sequence: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN Slices: SYIKPLPSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGH LTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRN PSGDFIVKALTPVDAFNDFFGSEFSDEEFDTVGGLVMSAFGHLPKRNEVVELGEFR VGGLVMSAFGHLPKRNEVVELGEFRFRVLNADSRRVHLLRLSPLQN SYIKPLPSGDFIVKALTPVDAFND Image credits: Protein Data Bank (PDB); UCSF Chimera

Accuracy on all “Slices”:Ab initio (Rosetta 3.2) BAD

Accuracy on all “Slices”:comparative modeling (protinfo cm)

“Best” Model selection Optimization • Assign a score to all models (lower score = better model) then cluster top 10% • Optimization: • K-means vs. Rosetta hierarchal clustering • RAPDF vs. Rosetta scoring Low energy score (good model) High energy score (bad model) Image credits: UCSF Chimera Rosetta 3.2

Acknowledgements Thanks to the Baker Lab for making Rosetta available! • Ram Samudrala, PI Computational Biology Group, Department of Microbiology • The Group: • Ling-Hong Hung • Mike Shannon • Mike Zhou • Stewart Moughon • George White • Brian Buttrick • Jeremy Horst • Thomas Wood • Raymond Zhang Research supported by: • Levinson Emerging Scholars Program • Mary Gates Research Endowment • NSF REU grant Special thanks to: • Art and Rita Levinson • All the staff at the Undergraduate Research Program • Thank you for all your incredible support!

Model quality assessment • Not a trivial problem, no widely-accepted best method • Two similar but independent popular algorithms used:TMscore & MaxSub • Basic method: • Optimal alignment of Cα atoms • TMscore: considers all residues but weights better-aligning higherMaxSub: find largest subset of model residues that superimpose well upon corresponding residues TMscore = 0.28 TMscore = 0.60 Image credits: TMscore; RasMol

Software overview ProtinfoCM & MODELLER (comparative modeling) Have animated slide in “tool tips” about specifics of Protinfo & MODELLER (Eswaret al. 2006)

Software overview RosettaAB (ab initio modeling) Image credit: depts.washington.edu/yeastrc/pages/rosetta.html Build fragment library: find set of sequence segments (< 10 residues) using input sequence Model generation: assemble fragments; after each insertion do energy minimization step Selection of “best” model:score & cluster 1,000-10,000 generated models Image credits: UCSF Chimera Rosetta 3.2

Running predictions • Must be careful not to cheat! • Comparative modeling:templates < 85% identities • RosettaAB:No fragments homologousto target were used • Ab initio predictions run in parallel across computer cluster

Application of Protein structure prediction software for EST Data

Application of Protein structure prediction software for EST Data

Presentation Transcript

Protein structure prediction

Prediction of protein structure

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Prediction of protein structure

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction