550 likes | 787 Views
Comparative Sequence Analysis. www.dcode.org. Ivan Ovcharenko Lawrence Livermore National Laboratory. BioQUEST Workshop, Beloit, June 2004. Comparative genomics Evolution of noncoding elements Aligning vertebrate genomes Function of the human gene deserts
E N D
Comparative Sequence Analysis www.dcode.org Ivan Ovcharenko Lawrence Livermore National Laboratory BioQUEST Workshop, Beloit, June 2004
Comparative genomics • Evolution of noncoding elements • Aligning vertebrate genomes • Function of the human gene deserts • Redefining comparative sequence analysis • Phylogenetic shadowing • Transcriptional gene regulation
The Genome Sequence: The Ultimate Code of Life ~ 50% is junk(repetitive elements) only 3% is coding for proteins the function of the rest ~47% (noncoding, nonrepetitive DNA) is unknown >hg16_dna range=chr11:31781924-31785923 TCAGGAACTTTGAAATGTTTTAAAACCCCAACTTTCTCCCCCATTTAAAC AGGCGGATTCATCGGCACTGGCCACCATATGGGCCCTTGGAGATCTATTG AGATGACCACCAACACTTGAATAGCGAGGGGCTGCTTTTCAGCGCTGCAC AATGCCCCGCGAGTAAGGGAAACTATTAAACTCCTGGGGCAGGAGCGTTG GCAAACTTTCGTGGGCAGAATTTTGAGGCTACAATGAGCGCGGACAACAA AAGGATTCTCTTGAGGCGTGCAGCGGGCCACATTGTGTTACAAGAAGCCC AGTCAACAGACTTTTCAGTGAAGTGTGTTAACCCCTCTGCTCTGCTATCA TTAATCACTGTCCGAAGAGCGGGCGCCTCCGTGCTATTTAGGGCGCTTGG CTGGGGGGATGGAGGGTGGATGGGGGGGCCAGGGCCCAGCATGGGGGGAG GCAGGGAGAGTGGACGGGGACCAGGGCTGGGTTCCTACATAGAGGAGATG GAGGGGAGGCAGGATGGAAACCAGCGGTGGGGGTGGAAGCAAGGGGGAAG GATTGGGGGGCCTGGGTTAGGGGAAAGACAGAGGGCGATGGAGGGAAAAA GAGGGCGATGGAGGGGAAAAGAAGGCTCAAAAAACATAGAGGCTAGAAAG GTATTTTTAAAAAAGGACAGAAAAGAATGCTGAGAGGAAAAAGAGACACG AGGGCCGAACAAGAGTGGGAGAGAGAGGAAAAGGAGGATGAGGGCCAGAG AATATTAGTAACTGAGCCCCATCTGGACTCTGGGTCTTTGCACTCCATCA GAAAGGTGGGGGTCGAGGAGGGCTACTTAGCTGAGGGAGACGCGCTCCGC TCACGTGTGCGGGCACAAGCGTCTGTGCTAATTTACTGCCCCAAGTTTCC GGGGACTTTTCAAAGCGTTTTTCAAGGGAAGAAATGAAGCGACCACCCCC ACCCCTCGCTTTATTTTCGGGTTTGGTGAAGAAGGAAGACTGGAAATAGC TCCTTTTGGCCAACTAGAAAGGCCGGAGGGTTATTGCTTTTGGAAAACAG ACAAAAATCTGTGCACATCTGGTATGGGGTGGGGGACACTGAGGAGAACA CAATGCCCATCTCCCCATGGCCACTCATGCCCATGCCTTCCTAGGGGCCC CATCTCGGTCCCTTTTCTGGCACATTCGATCTCGCCAATTAAACAAAGTT GCCCGAATCTGCCTCCGAAGAACCCCGCCGATAGCATGCTCTGCTCTCAT TTGCCTCTTTGACATTTTCTTAATTTTAAAACATGGAGATTCACATTCTT ATCCATGTTCTGTCTCACACAAACATACACACGGGTTTACACAGGCAGCA CGCGATCGCCGCCAGGCCCTGTGCTGCCTCCAGAACTGACACTTAAGAGA GAAAAGTCAGCAGGGACAGTAGAGCTCAATTTTAAATCTGGAAAAAAAAA AAAAAAAAAAAAAGATGGGAAGCGGGGATTGGAATTCCACAGCAAAAAGA AACCTGTCGCTGCAGGATCCCTTCTCTACCCCGCGGGGAGAGCGGCACGG AGACAGTTCATTACTTTAGAAGTGGCAACTGTTTGCAGCCAGGCGGTGAC CTAGCGGCTGCTCTTACATAAAATGGGTACATTTCCCCCCACTTTAGTGG ATTTGCCTTCCACTCTTAAAGCTTTTAACAAAATAAAACTAGAAGTTGGA TCTCGACTCCCCCACCCCCACGATAAACCTAAGTGGTGGACAATTAAGAT ATCTTCTTCAAAAGGCGCCCCCTCGGAGCCGCGCAAAGCAGGGGCCTTCA GTGGGTGCCGTTCACCTTCCAGCCTAATCCGTGAGAAAGCGAGTGAAAGC GCCTCCCATTATCCCAGCCCCAGGACCATCTGACGATGGGAATAGGATTT GTTTCCTGGAAGGAGGTGAGAGAGAGAGAGAGAGAGAGAGACAGAGAGAG
Biologically functional regions in the genome tend to stay conserved through the evolution. Therefore, by aligning homologous sequences from different, but related species we can identify Evolutionary Conserved Regions (ECRs) with a putative functional importance 1880th 1920th 1950th 2000th Comparative Sequence Analysis
Evolution of the genomic code Genomic modifications empowered the evolution: mutations insertions / deletions duplications rearrangements … A functional element Functional regions of the genome accumulated less mutations,Natural selection eliminated species with mutations altering the critical function of important elements actgactgactgATATTGACAgtttgttgttgttaa agggacaaactgATATTGACAgt---ttgttgttaa aggg--aaactgATATTGACAgt---ttgaaattaa tggg--aaaccaATATTGACAgt-actcgaaattaa tggg--aaaccaATATTGACAgt-actcgaaatgta Functionally important elements in the DNA stayed conserved through the evolution How to find evolutionary conserved elements? Millions of years of evolution
Human ACTTTACGGGATCTATCTATACCGGTAACGTAATCCGATACCAGT |||||||||||||| |||||||||||| Mouse ACTTTACGGGATCTCTCTATACCGGTAAAAAAAATTTAGT step 1- find matches Human ACTTTACGGGATCTATCTATACCGGTAACGTAATCCGATACCAGT ||||||||||||||:|||||||||||| Mouse ACTTTACGGGATCTCTCTATACCGGTAAAAAAAATTTAGT Human ACTTTACGGGATCTATCTATACCGGTA----ACGT—-AATCCGATACCAGT ||||||||||||||:|||||||||||| |::| |||Mouse ACTTTACGGGATCTCTCTATACCGGTAAAAAAAATTT-----------AGT step 2- find mismatches step 3- insert gaps tolinearize thealignment Sequence Alignment Human ACTTTACGGGATCTATCTATACCGGTAACGTAATCCGATACCAGT Mouse ACTTTACGGGATCTCTCTATACCGGTAAAAAAAATTTAGT
Conserved Elements Human ACTTTACGGGATCTATCTATACCGGTA----ACGT—-AATCCGATACCAGT ||||||||||||||:|||||||||||| |::| |||Mouse ACTTTACGGGATCTCTCTATACCGGTAAAAAAAATTT-----------AGT CONSERVEDDIVERGED Numeric criteria of conservation - minimal percent identity over minimal length Current case: 95% / 30 bps Common criteria: 70% / 100 bps General: ????
Human aaTtAAGGgTAAgTTTAcAtTGtttggAGCAAagGAaTAgcgATGcTCtCTTTGAATGAC | |||| ||| |||| | || ||||| || || ||| || ||||||||||| Mouse --TcAAGGcTAAaTTTAtAcTG----aAGCAActGAcTActaATGtTCcCTTTGAATGAC 369920 369930 369940 369950 369960 660 670 680 690 700 710 Human GTATtTGAACAGtTCAATAGAAAAaCTgGTAATGTATCAAAGAGCATCTTAAATTtTGAA 70 80 90 100 110 |||| ||||||| ||||||||||| || ||||||||||||||||||||||||||| |||| Human cAAGAgATTA---TTTTtAAATAAGcacCAAaTAcAAatAAAATgCtAtTgGCTAAAGTT Mouse GTATgTGAACAGcTCAATAGAAAAtCT-GTAATGTATCAAAGAGCATCTTAAATTgTGAA |||| |||| |||| ||||||| ||| || || ||||| | | | ||||||||| 370570 370580 370590 370600 370610 370620 Mouse tAAGAtATTActaTTTTgAAATAAGtgtCAAgTAgAAgcAAAATaCcAaTtGCTAAAGTT 369970 369980 369990 370000 370010 370020 41 730 740 750 760 770 Human GAGATCtTtCTGCctACTTTCtTtTaggGCAcaCCaCTcTgCTTTACTTtaAtGcATTGT 120 130 140 150 160 170 |||||| | |||| |||||| | | ||| || || | |||||||| | | ||||| Human CAaTTtgTTTTgCATAcTTGTTTCTAATAAGgACAtAtGAgcCacAAAATaGCCAAAGGG Mouse GAGATC-TcCTGCtcACTTTCcTgTccaGCAttCCtCTtTcCTTTACTTagAgGaATTGT || || |||| |||| |||||||||||||| ||| | || | ||||| ||||||||| 370630 370640 370650 370660 370670 370680 Mouse CAgTTcaTTTTcCATAtTTGTTTCTAATAAGtACAcAcGActCttAAAATcGCCAAAGGG 370030 370040 370050 370060 370070 370080 780 790 800 810 820 830 Human TATTTAACCAGTCAATGAGAAGtCTGtGCTTTtGGTGTGAACTCATCTtGAGTGATCTTT 180 190 200 210 220 230 |||||||||||||||||||||| ||| ||||| ||||||||||||||| ||||||||||| Human AGgGAAAAaaCCCTcAACtgCTAACAGCACATTAACAAAGTATAGAAAcGAAAGACACTT Mouse TATTTAACCAGTCAATGAGAAGcCTGgGCTTTcGGTGTGAACTCATCTcGAGTGATCTTT || ||||| |||| ||| |||||||||||||||||||||||||||| ||||||||||| 370690 370700 370710 370720 370730 370740 Mouse AGaGAAAAg-CCCTgAACgtCTAACAGCACATTAACAAAGTATAGAAAgGAAAGACACTT 370090 370100 370110 370120 370130 370140 840 850 860 870 880 890 Human TATTAATGTACATTAAcCAATTTCAAGGACAACAGGATAAGGTTACTTtTGAAagGCTTT 240 250 260 270 280 290 |||||||||||||||| ||||||||||||||||||||||||||||||| |||| ||||| Human TTCTTTGGATTTCAGCCTTGTCATTTCCAATTTTCTGCTCCTTGGACATGCTTGTATTCA Mouse TATTAATGTACATTAAgCAATTTCAAGGACAACAGGATAAGGTTACTTcTGAAttGCTTT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 370750 370760 370770 370780 370790 370800 Mouse TTCTTTGGATTTCAGCCTTGTCATTTCCAATTTTCTGCTCCTTGGACATGCTTGTATTCA 370150 370160 370170 370180 370190 370200 900 910 920 930 940 950 Human CTCAAGAAAtGGATTTATATTCaTCtAAAATAATCtTAAtTCACATGAcACTGTTTATtA 300 310 320 330 340 350 ||||||||| |||||||||||| || ||||||||| ||| |||||||| ||||||||| | Human AATTCTGGAACATCTATTCAGCATATCAATCCTAATTAGACAATCTGGGTCTGGAAAGGA Mouse CTCAAGAAAcGGATTTATATTCtTCcAAAATAATCgTAAcTCACATGAgACTGTTTATcA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 370810 370820 370830 370840 370850 370860 Mouse AATTCTGGAACATCTATTCAGCATATCAATCCTAATTAGACAATCTGGGTCTGGAAAGGA 370210 370220 370230 370240 370250 370260 960 970 980 990 1000 1010 Human t---tAAAAAAtTAGATAAaCcAAGTCcTCTTaAAAtGTAcCAtTtTCATAAGaAaAACa 360 370 380 390 400 410 |||||| ||||||| | ||||| |||| ||| ||| || | ||||||| | ||| Human TGaGAGCTGGGTCATTTGCATAATTTAATCATAAATACTCAGTGATACATATTTCCAAAT Mouse ggaagAAAAAAaTAGATAAgCtAAGTCaTCTTgAAA-GTAtCAcTgTCATAAGgAgAACg || ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 370870 370880 370890 370900 370910 370920 Mouse TGgGAGCTGGGTCATTTGCATAATTTAATCATAAATACTCAGTGATACATATTTCCAAAT 370270 370280 370290 370300 370310 370320 1020 1030 1040 1050 1060 1070 Human TTaTaAtATaCTtaGTgGAGctctAAGAACCCAGGTGGCTAATCTGA-TTTTTaAAAAAG 420 430 440 450 460 470 || | | || || || ||| ||||||||||||||||||||||| ||||| |||||| Human GCATTTGTACAATTATCTTTTCATCCTTGGGGCAATGGTATTAATATGATTAGGCAATAT Mouse TTgTcAcATtCTctGTaGAGacagAAGAACCCAGGTGGCTAATCTGAtTTTTTtAAAAAG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 370930 370940 370950 370960 370970 370980 Mouse GCATTTGTACAATTATCTTTTCATCCTTGGGGCAATGGTATTAATATGATTAGGCAATAT 370330 370340 370350 370360 370370 370380 1080 1090 1100 1110 1120 1130 Human AGATTCTGCTTTGTATGTTAATTAGTacaAAAGAAAGAAGTcaCATTTGTGAGTTTAAAT 480 490 500 510 520 530 |||||||||||||||||||||||||| |||||||||||| ||||||||||||||||| Human TTCTGGAAAAAACAGACAAGTATGCACTCTTTTTAACTGCAGCTTAgGGCGATATGAAAA Mouse AGATTCTGCTTTGTATGTTAATTAGTgacAAAGAAAGAAGTggCATTTGTGAGTTTAAAT |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| 370990 371000 371010 371020 371030 371040 Mouse TTCTGGAAAAAACAGACAAGTATGCACTCTTTTTAACTGCAGCTTAaGGCGATATGAAAA 370390 370400 370410 370420 370430 370440 1140 1150 1160 1170 1180 1190 Human gCACTATTCTTTtCcTTtCAATCaAatgAAAAAGTAGAAATTACTGCATGCAAATATTCA 540 550 560 570 580 590 ||||||||||| | || ||||| | |||||||||||||||||||||||||||||||| Human ATTAATTAATTTCTGAAGAAAATCAATTTCTCTACGTGACCACATTAGACATtgCTAAAC Mouse aCACTATTCTTTcCtTTaCAATCgAgcaAAAAAGTAGAAATTACTGCATGCAAATATTCA |||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| 371050 371060 371070 371080 371090 371100 Mouse ATTAATTAATTTCTGAAGAAAATCAATTTCTCTACGTGACCACATTAGACATcaCTAAAC 370450 370460 370470 370480 370490 370500 Huge alignments How to use them efficiently?
Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W.Genome Research, 2000PipMaker: http://bio.cse.psu.edu/pipmaker/ • Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I. • Bioinformatics 2000 • Vista: http:www.gsd.lbl.gov/vista Vertical coordinate gives an average percent identity in the window of 100bps centered at a given nucleotide Graphical conservation profiles 80%, 100bpsalignment block 1. Percent identity plots 2. Smooth graphs Colored regions correspond to areas of evolutionary conservaiton
From Comparative Genomics to Genome Biology
Experimental assesment of the biological functionof evolutionary conserved regions 245 conserved elements 155 exons 90 noncoding (>70% >100bp) 5q31 region Cyclin I-homolog KIF3 100% 50% ECR-1 401 bp 84% IL-4 IL-13 KIF3 RAD50 RAD50 EXONS Conserved Non-Transcribed Sequences
Removal of the ECR-1 from the mouse genome IL 4 IL 13 ECR-1 LoxP LoxP ECR-1 wild type ECR-1 knockout
ECR-1 IL4 IL13 Rad50 IL5 10kb 6kb 120kb Expression of 3 cytokines reduced in ECR-1 knockout WT WT WT Pg/ml Pg/ml Pg/ml -ECR1/-ECR1 -ECR1/-ECR1 -ECR1/-ECR1 0 0 0 IL4 IL13 IL5 Loots et al., Science (2000)
24 CHROMOSOMES form chapters: Average chromosome is ~100 million letters long TTATCTTTCAAGATTTTAAAGGTGTTCCTAATATTTTACACAAAAGCATG AGCACTAGATATGGTTGCAAAATACTGGGTGATGAGTTATACTGCCATTC TCTGCTTTCCTGTGAACTCCTTTATTTGTATAGTAGCTATATGCTCAGAC GTTGAAAATATAAGAAGTGAAGTACCCTGAAAAGTATCACATGATGGCAC TGTTTCCATTTCCACATCCAATATTATGAAATAAAGCTATAATAAACTGG TATTAAGAATGGGGTATAATGCCAGTGTATTTTGTATAATTTATGTAAAA TAAAAATCTAACCACTATGGTTATTAATATGGGTACTAAAGTGAATTCAT AGATTTTTCACAAAATGTTTTGTAAAAGCTTGCATTTCTATAATGTCTAT AATTTAGATCACAAAGAAACAATTTATCTAGATATTAACAATTTTAGTAA CACGGAAAACAGCTTCATTAATTACTTGAGTTGCTTTACAAACTATTTTT TAAAATAGTATATTTTATGTTATATTTCAGTTTTAATTGGGAAGAAATAA CGCTGTATCATACATGAGATTTATCTGTGGCAAATATGACCATTTGCATG GAATTATTTCCGAAGAATGCAAAGAAAGTGTATAAATAATATTGAAAAGT ACATGGATCAGTGGTTGAAGGGATCAAGCACAATTTTAAAGTGAACAAAA TTTAAATGTGGCCAACCTGAATATTTAAAGGGTTCATTAATCTGAGAAAT GTAAATGTTAAATGGTGTGTGATTTCAACTACCATTATTTATTATGGTAA ACAGTCTTTCCTATATAATAGGCATGAAAAAATGGTGTGGAGTGATTATC ATCTCAGGAATGAGAGTACAATAATTTTCTATTCCTAACAAAAAAGAAAA AAAAATGATCAAAATGTGATGTGATATATAGTGAAGTACTATGTAGATGT GGATGTTTAAAGATGAACCAAGCATCAGGATTTCACCAAATTTTATCTAT AATAATGAATTAATAATAGTGGATATAGATACATCTTCCCAGTGGCATGA GTGTGGTAAAAAAGATACAAAGCTCTATGGACTTGAAATGATGCCCCTCT AGTGATGTTAAAGAACCTAATGGCCAGAATTTGGAAGTGCAGCAAGTGAG TGCTGTAAGAATATTTTTAAATGTGATCAGTTTATATTTGTTTTAATATG ACAGAAAAAATACTTTGCACAATTTTCCTTTTAATTCATCTGTGAACTTG TCTCGGGGGGAAAACATACATGTGAAGTGTTCTTACTGTATTCTTTTAAA AATAAATATGAAAAATAATCATGCAGGTAAACCAATTCCAAATATTTATC TTAACGACATCCCCAAAATCTTAAAGGTATATACTAGGCATAAACCTTAA ACCTTTAATCACAGTGGAGATAAATTCCTCCTACAAAAAGAAATGTGTAA AGTAGAACTAACTATTCTGATATATTATTCTATGTAATCATTTCTCAAGT CTGTCTTTAAACAAATAGTTACATCTTATTATAAAGACAATAAATAAATA CATTTTCCTAGAAATCCATCTTGAAATAAGGATTTCTTGCACCCTAGTTT CAAGAATACACTGGTGTCCTATCACCTCCTTTGGGAAAGTGACAGTTTGC ATAATACTTTTCACATAAGAGAAAAATTTAAATAATGATATTGAGGAAAT TGTTGAAACATTGCCTAATGGTATAGTAACAAAAAGTATTCATAAATCTG TACTGTAGAAGAGAAAATATACACTACAATAATCTGTTCATTTGTCTTAG AAGAGGGGAGAAAAAAACCCAGAATACTGAAATAGGAAATTTCCATGTTC ACTGTATTTCACCATGCAAATCACTTGCAATTTCCAAATGCCAGTGTTAC TTTTCAGGACAAATTTCACACAAAAGGAATTCAGTGATTATTCATCCAGT TTAATAATTCAATTAAATAAGTCTGATGCTGTCAGGTGTTCTTTTAATAA GENES are short stories: Every chromosome has ~1,000 genes Every gene has a function in the human body Human genome GENOME is a huge book of life: Sequence of 3 billion letters from a short, 4-letters alphabet: A, C, T, and G
Sequenced vertebrate genomes human x4 ~ 80 MY mouse x4 rat x4 ~ 400 MY fugu x0.5 zebrafish x2 tetraodon x0.5
Comparing Genomes 1. Mask out repetitive elements (RepeatMasker) 2. Map syntenic regions in two genomes (BLAT) 3. Align syntenic regions (BLASTZ) 4. Visualize alignments (ECR Browser)
Times to align human and mouse genomes (3Gb vs 3Gb) Mapping/Aligning Location Time Blastz/Blastz UCSC Genome Browser 1000 days (3 years) Blat/Avid Vista Genome Browser 1 month Blat/Blastz ECR Browser <5 days http://ecrbrowser.dcode.org/ Why do we want to align genomes faster?
Human vs mouse and human vs fugu genome comparisons 10%of the human genome is conserved Over 1,000,000ECRs (Evolutionary Conserved Regions) 0.2%of the human genomes is conserved 41,067 ECRs Why do we observe so many conserved elements?How many of them are functional? Is fugu an ideal organism for finding regulatory elements in the human genome?
Clean dataset of regulatory elements 14,680 non-exonic ECRs Gene predictions Ensemble Genscan FGENESH++ Sanger22 Acembly Twinscan Human / nonhuman mRNA 4,110 ECRs Pseudogenes Human/fugu ECR is required to have corresponding human/mouse ECR 1,885 ECRs 146 promoter ECRs 1,739distant regulatory ECRs
Gene Deserts in the Human Genome
Identification of Gene Deserts Intergenic interval: - no RefSeq or Ensemble genes (20k genes, 195k exons) - no sequence gaps Gene deserts: 3% of longest intergenic intervals 25% of the human genome sequence Gene deserts 0.5Mb - 4Mb
Mycoplasmas 0.6 Mb ~ 600 genes Smallest living organisms (0.1m) Cyanobacteria (blue-green algae) 3.5 Mb ~ 4,000 genes Coverts CO2 into O2 E. coli 4.6 Mb ~ 4,500 genes Common inhabitant of the human intestine. Yersinia pestis 4.8 Mb ~ 4,000 genes Causes plague Relative size of the human gene deserts 0.5 Mb … 4 Mb
gene desert gene desert gene desert present in mouse genome
GC content and SNP density SNP Density (N/Mb) Gene deserts Regular intergenic 459.0 316.5 Low GC content and increase in SNPs density suggest a decreased amount of functional elements in gene deserts
DACH gene desert on human chr13 DACH 1,330 kb 876 kb 430 kb Over 1,000 human/mouse ECRs! Could any of them be functional?
Dachshund 800 Kb FLJ 100% 75% 50% 100% 75% 50% Identification of 13 distal human/fugu ECRs around Dachshund Dachshund 30 Kb
DACH gene pattern of expression Brain / CNS Eyes Limbs
Nobrega et al., Science 2003 Testing the function of ECRs: LacZ transient transgenics LacZ Hsp68
How to find core enhancers without comparisons to distant species
SOM: a highly conserved developmental transcription factor 65 human/mouse ECRs NONE human/fugu ECRs NONE human/chicken ECRs
Genes ‘flanked’ by human/fugu ECRs Gene A Gene B Gene C or Only 5.6% of the human genes are flanked by human/fugu noncoding ECRs
Fugu ECRs Mouse ECRs Core ECRs Human/mouse counterparts of human/fugu ECRs
Length Core ECRs 350bps/77% • Recapitulate ~90% of the human/fugu ECRs • 2. Reduce 10-fold the numberof putative enhancers (human/mouse ECRs) Percent identity Ovcharenko et al., Genomics, in press
98% sequence identity How to find functional elements? Comparative sequence analysis of closely related organisms Human vs Chimp
Different species accumulated differences independently Allen CTCGTCCAGTCTGGAGTGCAGTGGCGCGATCGCAGCTCACCGCAATGTCCGCCTCCCGGG 147 Green CTCGTCCAGTCTGGAGTGCAGTGGCGCGATCGCAGCTCACCGCAACGTCCGCCTCCCAGG 76 Human CTTGCTCAGGCTGGAGTGCAGTGGCATGATCTTGGCACACTGCAACCTCCACCTCCCGTG 281 Chimp CTTGCTCAGGCTGGAGTGCAGTGGCATGATCTTGGCTCACCGCAACCTCCACCTCCCGTG 270 Orangutan CTCGCTCAGGCTGGAGTGCAGTGGCGTGATCTTGGCTCACCGCAACCTCCACCCCCCGGG 193 Colobus TTCGTCCAGTCGGGAGTGCTGTGGCGCGATTGCAGCTCACGGCAACGTCCGCCTCCCGGG 214 Douc TTCATCCAGTCTGGAGTGCAGTGGCGCAATCGCAGCTCACCACAATGTCCGCTTCCCGGG 211 Francois TTCGTCCAGTCTGGAGTGCAGTGGCGCGATTGCAGCTCACCGCAACGTCCGCCTCCCGGG 77 Drill CTTGTCCAGTCTGGAGTGCAGTGGTGCGATCGCAGCTCACCGCAACGTCCGCCTCCCGGG 186 Mangabey CTCGTCCAGTCTGGAGTGCAGTGGTGCGATCGCAGCTCACCACAACGTCCGCCTCCCGGG 75 Owl TTCACCCAGGCTGGAGTACAGGGGCATGATCTCAGCTCACTGCAACCTCCACCTCCAAGG 191 Squirrel TTCACCCAGGCTAGAGTACAGTGGCATGATCTCAGCTCACTGCAACCTCCACCTCCAAGA 76 Tamarin TACCCCCCGGGTGGAATACCGGGGCATGATCTCAGCTCACTGCAACCTCCACCTCCCAGG 212 Titi TTCACCCAGGCTGGAGTACAGTGGCATGATCTCAGCTGACTGCAACCTTCACCTCCAAGG 202 * * ** * * * ** ** ** ** *** * * * **
* * ** * * * ** ** ** ** *** * * * ** 2-state trainable HMM model to identify conserved elements using the sequence of complete matches * * ** * * * ** ** ** ** *** * * * **
Minimum number of primates? mouse 14 primates 1 primate (Allouatta seniculus) I. Ovcharenko et al., Genome Research, 14(6), 2004
HB1 Human/baboon alignments identify primate-specific regulatory elements
Noncoding ECRs & Transcriptional GeneRegulation
CNS regulatorymodule B Gene X Transcriptional gene regulation Limbs regulatorymodule A Gene X
Regulatory module structure Transcription factors Gene regulatorymodule actgactgactgatattgacagtttgttgttgttaa Footprints or bindings sites are known for many transcription factors and they areextremely short (~ 6-10 bp) Computational predictions of transcription factor binding sites are overwhelmed with false positives
Human ACTTTGATACATCTATCTATA ||||||||||||||:||||||Mouse ACTTTGATACATCTCTCTATA Human ACTTTGATACATCTATCTATA |||||Mouse ACTTT---------------- Human ACTTTCCTACATCTATCTATA |||||::|||||||:||||||Mouse ACTTTGATACATCTCTCTATA Human -----GATACATCTATCTATA ||||| Mouse ACTTTGATAC-----------