1 / 34

What is computational biology?

What is computational biology?. Genome. The entire hereditary information content of an organism. DNA. String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism.

ita
Download Presentation

What is computational biology?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is computational biology?

  2. Genome • The entire hereditary informationcontent of an organism

  3. DNA • String over 4 letter alphabet A, T, G, C • Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) • Genome size: number of base pairs in an organism

  4. Genome Sizes ~ 400 genomes sequenced

  5. How are genomes sequenced? • Can only sequence a few hundred base pairs at a time • Make many copies of the DNA and cut into smaller (overlapping) pieces • Assemble pieces: certain substrings occur in multiple fragments

  6. Genomes to Life ? ATGCCTTACGTACCCTGCGGCAGCACT Genome

  7. Portions of DNA code for genes, which carry the information for making proteins • Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)

  8. Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

  9. Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT

  10. The Genetic Code AUG = methionine/start UUA = Leucine UUG = Leucine UAA = Stop UAG = Stop UGA = Stop . . . Stryer, Biochemistry

  11. Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauguuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

  12. Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating (3 letter amino acid code) Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ... (1 letter code) M S A R I S R K E I H Y V L F K ...

  13. Actual protein sequence M Y Y L K N T N F W M F G L F F ... Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating (3 letter amino acid code) Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ... (1 letter code) M S A R I S R K E I H Y V L F K ...

  14. Computational Gene Finding Methods • Statistical bias: protein coding regions “look different” - compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets) • Sequence similarity - similar to known protein?

  15. Gene finding is hard • In some genomes, only a small portion of genome codes for protein (needle in haystack) • Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short • Have to get the precise boundaries to get correct protein

  16. Number of genes

  17. Predicting Protein Function MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT DNA binding protein

  18. Functions of Human Proteins Science, 2001

  19. Sequence similarity Ex: cystic fibrosis gene and bacterial nickel transport gene CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----- NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG NT:QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------- CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV

  20. Database Searches http://www.ncbi.nlm.nih.gov

  21. Database Searches Sequences producing significant alignments: E-Value gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84 gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77 gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59 gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59 gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43 gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41 gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41 gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41

  22. Protein Structure Sequence: KETAAAKFERQHMDSSTSAASSSN… Structure:

  23. Proteins Secondary Tertiary Quaternary Primary Polypeptide chain Assembled subunits Amino acids a-helix Lehninger, Principles of Biochemistry

  24. Protein Structure Prediction • Physics-based methods • Statistics-based method

  25. Statistics & Protein Structure Prediction Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.

  26. Secondary structure prediction • Given a protein sequence, can you tell its secondary structure • E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa a=alpha, b=beta : ~70% accuracy (neural nets or other learning techniques)

  27. Genome annotation • Many other important features of DNA • E.g., proteins bind DNA regulatory elements: determines which genes are “on” when • Statistical & comparative approaches for finding them • Motif finding

  28. Universal phylogenetic tree Prokaryotes Eukaryotes Woese et al.

  29. Building phylogenetic trees Use DNA (or protein) sequences from various organisms e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA

  30. Building phylogenetic trees E.g., Distance Matrix: 1 2 Tree: 1 1 Human Mouse Yeast

  31. Intracellular networks

  32. Network of cells

  33. fn

  34. Lecture Notes • www.cs.princeton.edu/~mona/computational_biology_notes.html

More Related