350 likes | 483 Views
What is computational biology?. Genome. The entire hereditary information content of an organism. DNA. String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism.
E N D
Genome • The entire hereditary informationcontent of an organism
DNA • String over 4 letter alphabet A, T, G, C • Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) • Genome size: number of base pairs in an organism
Genome Sizes ~ 400 genomes sequenced
How are genomes sequenced? • Can only sequence a few hundred base pairs at a time • Make many copies of the DNA and cut into smaller (overlapping) pieces • Assemble pieces: certain substrings occur in multiple fragments
Genomes to Life ? ATGCCTTACGTACCCTGCGGCAGCACT Genome
Portions of DNA code for genes, which carry the information for making proteins • Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)
Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT
The Genetic Code AUG = methionine/start UUA = Leucine UUG = Leucine UAA = Stop UAG = Stop UGA = Stop . . . Stryer, Biochemistry
Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauguuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating (3 letter amino acid code) Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ... (1 letter code) M S A R I S R K E I H Y V L F K ...
Actual protein sequence M Y Y L K N T N F W M F G L F F ... Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating (3 letter amino acid code) Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ... (1 letter code) M S A R I S R K E I H Y V L F K ...
Computational Gene Finding Methods • Statistical bias: protein coding regions “look different” - compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets) • Sequence similarity - similar to known protein?
Gene finding is hard • In some genomes, only a small portion of genome codes for protein (needle in haystack) • Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short • Have to get the precise boundaries to get correct protein
Predicting Protein Function MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT DNA binding protein
Functions of Human Proteins Science, 2001
Sequence similarity Ex: cystic fibrosis gene and bacterial nickel transport gene CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----- NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG NT:QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------- CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV
Database Searches http://www.ncbi.nlm.nih.gov
Database Searches Sequences producing significant alignments: E-Value gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84 gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77 gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59 gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59 gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43 gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41 gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41 gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41
Protein Structure Sequence: KETAAAKFERQHMDSSTSAASSSN… Structure:
Proteins Secondary Tertiary Quaternary Primary Polypeptide chain Assembled subunits Amino acids a-helix Lehninger, Principles of Biochemistry
Protein Structure Prediction • Physics-based methods • Statistics-based method
Statistics & Protein Structure Prediction Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.
Secondary structure prediction • Given a protein sequence, can you tell its secondary structure • E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa a=alpha, b=beta : ~70% accuracy (neural nets or other learning techniques)
Genome annotation • Many other important features of DNA • E.g., proteins bind DNA regulatory elements: determines which genes are “on” when • Statistical & comparative approaches for finding them • Motif finding
Universal phylogenetic tree Prokaryotes Eukaryotes Woese et al.
Building phylogenetic trees Use DNA (or protein) sequences from various organisms e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA
Building phylogenetic trees E.g., Distance Matrix: 1 2 Tree: 1 1 Human Mouse Yeast
Lecture Notes • www.cs.princeton.edu/~mona/computational_biology_notes.html