400 likes | 418 Views
Eukaryotic Comparative Genomics. June 2018 GEP Alumni Workshop. Barak Cohen. Charles Darwin. Motoo Kimura. Detecting Conserved Sequences. Evolution of Neutral DNA. A. A. T. C. T. A. A. T. T. G. C. T. G. T. G. A. T. T. C. A. G. A. G. T. A. G. C. A. G. T. G. A.
E N D
Eukaryotic Comparative Genomics June 2018 GEP Alumni Workshop Barak Cohen
Charles Darwin Motoo Kimura Detecting Conserved Sequences
Evolution of Neutral DNA A A T C T A A T T G C T G T G A T T C A G A G T A G C A G T G A T A A G T C T T T G A T G T T G T T G C A G G A G T A G T C G T A * * * * * * * * * * * * * * * * * * * * * * * * *
Evolution of Non-Neutral DNA A T C T A G T C C G A T G T G C G T A C C G A C C A T A A G G A T G C A C A C G T A T A C C A T G T G G T A T C C G A T C C A T A A G C A T A T C * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Multi-Species Alignment ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA GTACATCGATAGCTTAGAATGCTGGACGATCTC GTACGTCGATAGCATAGAATGCTGGACGATCTC * * * * ***********
How to do Comparative Genomics • Choose species to analyze • Align sequences • Identify streches of highly conserved nucleotides
Choose species closely related species distantly related species • Closely Related Species • align well • not many changes • Distantly Related Species • hard to align • lots of changes
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
Case Study: Coding vs.Non-Coding …TAA ATG…. ORF • Coding DNA • -codes for protein • -triplet code • -open reading frame (ORF) • -tend to be long (50-500 bp) • -highly constrained • Non-Coding DNA • -regulatory functions • -short (5-15 bp) • -degenerate • -variable spacing
CASE 1:Non-Coding ATG… …TAA GAL4
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
Closely-related sequences are uninformative ATG… GAL4 paradoxus TCTTCTGAGACAGCATCACTTCTTCTTNTTTTTTACATAACTTATTCTTCTATAATTTTC cerevisiae TCCTTTGAGACAGCATTCGCCCAGTATTTTTTTTATTCTACA-AACCTTCTATAATTT-C ** * *********** * * ******* ** * ************ * paradoxus AACGTATTTACATAGTTCTGTATCAGTTTAATCACCATAATATTGTTTTCCCTCAACTAA cerevisiae AAAGTATTTACATAATTCTGTATCAGTTTAATCACCATAATATCGTTTTCT-----TTGT ** *********** **************************** ****** * paradoxus TGAATGCAATTAGATTTTCTTATTGTTCCCTCGCGGCTTTTTTTTGTTTTATAATCTATT cerevisiae TTAGTGCAATTAATTTTTCCTATTGTTACTTCG-GGCCTTTTTCTGTTTTATGAGCTATT * * ******** ***** ******* * *** *** ***** ******** * ***** paradoxus TTTTCCGTCATTTCTTCCCCAGATTTCCAACTTCATCTCCAGATTGTGTCTATGTAATGC cerevisiae TTTTCCGTCATC-CTTCCCCAGATTTTCAGCTTCATCTCCAGATTGTGTCTACGTAATGC *********** ************* ** ********************** ******* paradoxus ATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCTACTGTCT cerevisiae ACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCTACTGTCT * ** ***** ** *** * ** ****** *** ********** ***************
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
Distantly-related sequences do not align ATG… GAL4 Noncoding (Promoter) cerevisiae ACTTACCAT-CAAC-CATAGATGGGTAAAC---GGTTAGTAACTAGGAACACGAT castelli AGA-GTCAAACTTTTCGT—ATA--TATATATAATATGTCTGATTGCTGGTT---T * ** * * * * * * * * *
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
UAS1 UAS2 UES MIG1 MIG1 Multiple sequence alignments reveal conserved elements ATG… GAL4 cerevisiae TGAGACAGCAT-CACTTCTT-CTTNTTTTTTACATAACTTATTCTTCTATAATTTTCAAC mikatae TGAGACAGCATTCACTTCTTTCTTTTTTTTTACATATCTTATTCTTCTATAATTTTCAAC Bayanus TGAGACAGCATTCGCCCAGT--ATTTTTTTTAT-TCTACAAACCTTCTATAATTT-CAAA kudriadzevi TGAGACTGCACTCCC--------TCTTCCTTTC------------TCCATAACTT---AC ****** *** * * * ** ** ** **** ** * paradoxus GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC kluyveri GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC cerevisiae GTATTTACATAATTCTGTATCAGTTTAATCACCATAAT------ATCGTTTTCTTTGT-- bayanus TTATTTACATAGTTTTGTATCAGTTTAATCACCATAATCGTAACACCGTTTTACCTCACC ********** ** *********************** * ***** * paradoxus TAATGAATGCAATTAGATTTTC-TTATTGTTCCC-TCGCGGCTTTTTTTTGTTTTATAAT kluyveri TAATGAATGCAATTAGATTTTCCTTATTGTTCCCCTCGCGGCTTTTTTTTGTTTTATAAT cerevisiae ---TTAGTGCAATTAATTTTTC-CTATTGTTACT-TCG-GGCCTTTTTCTGTTTTATGAG bayanus TGATGCGGG--A---ATCCTTC-AGACCGTTCTC-TCGCGC------------------- * * * *** * *** *** * paradoxus -CTATTTTTTCCGTCATTTCTTCCCC-AGATTTCCAACTTCAT-CTCCAGATTGTGTCTA kluyveri ACTATTTTTTCCGTCATTTCTTCCCCCAGATTTCCAACTTCATACTCCAGATTGTGTCTA cerevisiae -CTATTTTTTCCGTCATC-CTTCCCC-AGATTTTCAGCTTCAT-CTCCAGATTGTGTCTA bayanus -CTTTTTTTTTCGTCATTTCTTCCCC-AGATCTACAACTTTAA-CTCCAGACGGTGTATA ** ****** ****** ******* **** * ** *** * ******* **** ** paradoxus TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC kluyveri TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC cerevisiae CGTAATGCACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGC bayanus GGCAGTACAAGCAGTGCTTTTGGGAAGAGGCAAAGCTGCAGACCTCGAGAACAATGAAGC * * * ** ** * * ** ** * * ** ** **** *** *******
CASE 2:Coding ATG… …TAA CLN3
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
S.cerevisiae ~10Mya S. cariocanus S. paradoxus S. mikatae S. kudriavzevii ~20Mya S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S.castellii S. kluyveri Kluyveromyces lactis ~150Mya >350Mya Schizosaccharomyces pombe
Identification of Multi-Species Conserved Regions (MCS) Human cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct Chimp cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct Mouse ttcagtcgtttcccagtgtctctga-cattcagagactactttagtaagcattt-tctct Rat tcagtccttccctggcatctccag-cactcaa-gactactttagtaagcattt-tctctg Dog tcaatgactttcccagtctcttctactgggaagagattaggttgcaaatcatttttctct * * * * * * ** How can we decide if this region is “conserved?” Margulies et al (2003) Gen. Res. 13:2507-18
Binomial-Based Method for Detecting Conserved Sequences Human: AATGG Mouse: AATCG Status: CCCDC p = probability that a site is the same between human and mouse by chance alone (Kimura), q = 1-p For an alignment N base pairs long with n identities calculate the cumulative binomial probability as: Margulies et al (2003) Gen. Res. 13:2507-18
Tree Topology Influences Power Star Phylogeny Actual Phylogeny species A species F species B species E species C species D
Challenges in larger genomes Deciding on the neutral rate of substitution Local differences in neutral rate of substitutions Multiple hypothesis testing Repeat sequences and uneven base composition
PhastCons and the UCSC Browser Olig2 100 Kb upstream of Olig2
Motif Searching Across Several Multiple Alignments Gene 2 Gene N Gene 1 Gene 3 Species 1 … Species 2 Species 3
Information Content EcoR1 Random Rap1 GAATTC GAATTC GAATTC GAATTC GAATTC GAATTC GAATTC GCCTAC ACATTC TCATTC CGACTC GAATTC ATATCG GAAATG TGTATGGGTG TGTTCGGATT TGCATGGGTG TGTACAGGTG TGTATGGATG TGTTCGGGTT TGTATGGGTG
Weight Matrix Model of TATA Box G. Stormo
Weight Matrix Model of TATA Box Score = -24 ….A CT A T A A T G T … G. Stormo
Weight Matrix Model of TATA Box Score = 43 ….A C T A T A A T G T … G. Stormo
Weight Matrix Model of TATA Box N(b,i) F(b,i) S(b,i) = log[F(b,i)/P(b)] G. Stormo
Now we can compare motifs to each other A A C C G G T T
MAGMAunaligned motif finding in multispecies conserved regions Gene 2 Gene N Gene 1 Gene 3 Species 1 … Species 2 Species 3 *Ihuegbu, Stormo, & Buhler, JCB 19:139, 2012