1 / 51

Gene Recognition

Gene Recognition. Using Comparative Information. Using Comparative Information. Hox cluster is an example where everything is conserved. Genes. Intergenic. Separation. Mutations Gaps Frameshifts. 30% 1.3% 0.14%. 58% 14% 10.2%.   . 2-fold 10-fold 75-fold.

sheena
Download Presentation

Gene Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Recognition

  2. Using Comparative Information

  3. Using Comparative Information • Hox cluster is an example where everything is conserved

  4. Genes Intergenic Separation Mutations Gaps Frameshifts 30% 1.3% 0.14% 58% 14% 10.2%    2-fold 10-fold 75-fold Patterns of Conservation

  5. Comparison-based Gene Finders • Rosetta, 2000 • CEM, 2000 • First methods to apply comparative genomics (human-mouse) to improve gene prediction • Twinscan, 2001 • First HMM for comparative gene prediction in two genomes • SLAM, 2002 • Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes • NSCAN, 2006 • Best method to-date based on a phylo-HMM for multiple genome gene prediction

  6. Twinscan • Align the two sequences (eg. from human and mouse) • Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters • = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } • Run Viterbi using emissions ek(b) where b  { A-, A:, A|, …, T| } Emission distributions ek(b) estimated from real genes from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns Example Human: ACGGCGACGUGCACGU Mouse: ACUGUGACGUGCACUU Alignment: ||:|:|||||||||:|

  7. SLAM – Generalized Pair HMM Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e. e d

  8. NSCAN—Multiple Species Gene Prediction • GENSCAN • TWINSCAN • N-SCAN Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation |||:||:||:|||||:||||||||...... sequence Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1 GGTCAGC___CCAAGAACGTGTAG...... Informant2 GATCAGC___CCAAGAACGTGTAG...... Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA ... Target sequence: Informant sequences (vector): Joint prediction (use phylo-HMM):

  9. NSCAN—Multiple Species Gene Prediction X H C Y Y Z H X Z M R C M R

  10. Performance Comparison NSCAN human/mouse > Human/multiple informants GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution

  11. CONTRAST • 2-level architecture • No Phylo-HMM that models alignments Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrectc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg X SVM SVM CRF a b a b Y

  12. CONTRAST

  13. CONTRAST - Features • log P(y | x) ~ wTF(x, y) • F(x, y) = if(yi-1, yi, i, x) • f(yi-1, yi, i, x): • 1{yi-1 = INTRON, yi = EXON_FRAME_1} • 1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT) • 1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC) • (1-c)1{a<SVM_DONOR(i)<b} • (optional) 1{EXON_FRAME_1, EST_EVIDENCE}

  14. CONTRAST – SVM accuracies SN SP • Accuracy increases as we add informants • Diminishing returns after ~5 informants

  15. CONTRAST - Decoding Viterbi Decoding: maximize P(y | x) Maximum Expected Boundary Accuracy Decoding: maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x) Accuracy(yi-1, yi, B | x) = P(yi-1, yiis B | x) – (1 – P(yi-1, yi is B | x))

  16. CONTRAST - Training Maximum Conditional Likelihood Training: maximize L(w) = Pw(y | x) Maximum Expected Boundary Accuracy Training: ExpectedBoundaryAccuracy(w) = iAccuracyi Accuracyi = B1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) - B’ ≠ B P(yi-1, yi is exon boundary B’ | x)

  17. Performance Comparison Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken De Novo EST-assisted

  18. Performance Comparison

  19. Gene Regulation and Microarrays

  20. Overview • A. Gene Expression and Regulation • B. Measuring Gene Expression: Microarrays • C. Finding Regulatory Motifs

  21. Cells respond to environment Cell responds to environment—various external messages

  22. Genome is fixed – Cells are dynamic • A genome is static • Every cell in our body has a copy of same genome • A cell is dynamic • Responds to external conditions • Most cells follow a cell cycle of division • Cells differentiate during development • Gene expression varies according to: • Cell type • Cell cycle • External conditions • Location slide credits: M. Kellis

  23. Where gene regulation takes place • Opening of chromatin • Transcription • Translation • Protein stability • Protein modifications

  24. Transcriptional Regulation • Efficient place to regulate: No energy wasted making intermediate products • However, slowest response time After a receptor notices a change: • Cascade message to nucleus • Open chromatin & bind transcription factors • Recruit RNA polymerase and transcribe • Splice mRNA and send to cytoplasm • Translate into protein

  25. Transcription Factors Binding to DNA Transcription regulation: • Transcription factors bind DNA • Binding recognizes DNA substrings: • Regulatory motifs

  26. Promoter and Enhancers • Promoter necessary to start transcription • Enhancers can affect transcription from afar

  27. Gene Regulation with TFs Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

  28. Gene Regulation with TFs Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

  29. Gene Regulation with TFs New protein Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

  30. TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATTTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

  31. TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTTTTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Exons Promoter motifs 3’ UTR motifs Introns

  32. Example: A Human heat shock protein --158 0 HSE CCAAT AP2 AP2 CCAAT SP1 SP1 TATA • TATA box: positioning transcription start • TATA, CCAAT: constitutive transcription • GRE: glucocorticoid response • MRE: metal response • HSE: heat shock element GENE promoter of heat shock hsp70

  33. DNA Microarrays Measuring gene transcription in a high-throughput fashion

  34. What is a microarray

  35. What is a microarray • A 2D array of DNA sequences from thousands of genes • Each spot has many copies of same gene • Measure number of hybridizations per spot Result: • Thousands of “experiments” – one per gene – in one go • Perform many microarrays for different conditions: • Time during cell cycle • Temperature • Nutrient level

  36. Goal of Microarray Experiments • Measure level of gene expression across many different conditions: • Expression Matrix M: {genes}{conditions}: Mij = |genei| in conditionj • Group genes into coregulated sets • Observe cells under different conditions • Find genes with similar expression profiles • Potentially regulated by same TF slide credits: M. Kellis

  37. Clustering vs. Classification • Clustering • Idea: Groups of genes that share similar function have similar expression patterns • Hierarchical clustering • k-means • Bayesian approaches • Projection techniques • Principal Component Analysis • Independent Component Analysis • Classification • Idea: A cell can be in one of several states • (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) • Can we train an algorithm to use the gene expression patterns to determine which state a cell is in? • Support Vector Machines • Decision Trees • Neural Networks • K-Nearest Neighbors

  38. c a b h e d f a b d e f c g h g c2 c3 c1 d e f a b c g h Clustering Algorithms • Hierarchical c • K-means a b h e d f g slide credits: M. Kellis

  39. c a b h e d f g Hierarchical clustering • Bottom-up algorithm: • Initialization: each point in a separate cluster • At each step: • Choose the pair of closest clusters • Merge • The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y • Avoids the problem of specifying the number of clusters slide credits: M. Kellis

  40. Results of Clustering Gene Expression • CLUSTER is simple and easy to use • De facto standard for microarray analysis Time: O(N2M) N: #genes M: #conditions

  41. K-Means Clustering Algorithm • Each cluster Xi has a center ci • Define the clustering cost criterion • COST(X1,…Xk) = ∑Xi∑x Xi |x – ci|2 • Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST • K-means algorithm: • Initialize centers • Repeat: • Compute best clusters for given centers • → Attach each point to the closest center • Compute best centers for given clusters • → Choose the centroid of points in cluster • Until the changes in COST are “small” c c2 a b c3 h e d c1 f g slide credits: M. Kellis

  42. K-Means Algorithm • Randomly Initialize Clusters

  43. K-Means Algorithm • Assign data points to nearest clusters

  44. K-Means Algorithm • Recalculate Clusters

  45. K-Means Algorithm • Recalculate Clusters

  46. K-Means Algorithm • Repeat

  47. K-Means Algorithm • Repeat

  48. K-Means Algorithm • Repeat … until convergence Time: O(KNM) per iteration N: #genes M: #conditions

  49. Mixture of Gaussians – Probabilistic K-means • Data is modeled as mixture of K Gaussians • N(1, 2I), …, N(K, 2I) • Prior probabilities 1, …, K • Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder • P(x) = ∑i P(x | N(1, 2I))  i • Use EM to learn parameters

  50. Analysis of Clustering Data • Statistical Significance of Clusters • Gene Ontology http://www.geneontology.org/ • KEGG http://www.genome.jp/kegg/ • Regulatory motifs responsible for common expression • Regulatory Networks • Experimental Verification

More Related