1 / 25

gene prediction

gene prediction. roderic guigó i serra IMIM/UPF/CRG. number of genes in chromosome 22. initial annotation 545 Dunham et al., 1999 genscan+RT-PCR 590 Das et al., 2001 genscan+microarrays 730 Shoemaker et al., 2001 reviewed annotation 726 chr22 team, sanger, 2001

kuper
Download Presentation

gene prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gene prediction roderic guigó i serra IMIM/UPF/CRG

  2. number of genes in chromosome 22 • initial annotation 545 Dunham et al., 1999 • genscan+RT-PCR 590 Das et al., 2001 • genscan+microarrays 730 Shoemaker et al., 2001 • reviewed annotation 726 chr22 team, sanger, 2001 • mouse shotgun data +20 (our data) • geneid predictions 794 • genscan predictions 1128

  3. number of genes in human genome • Consortium 30.000-40.000 2001 • Celera 27.000-38.000 2001 • Consortium+Celera 50.000 Hogenesch et al. 2001 • DBsearches 65.000-75.000 Wrigth et al., 2001 • HumanGenomeSciences 90.000-120.000 Haseltine, 2001

  4. the human genome sequence decodificació del genoma ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

  5. the amino acid sequence of the proteins QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP

  6. INTRONS PROMOTOR EXONS ELEMENT REGULADOR ‘DOWNSTREAM’ ELEMENT REGULADOR ‘UPSTREAM’ Estructura dels Gens

  7. Del DNA al RNA

  8. Del RNA a la Proteïna

  9. Mecanisme Molecular

  10. Prediction of splice sites

  11. accuracy of gene prediction programs

  12. accuracy of gene prediction programs

  13. accuracy of gene prediction programs

  14. comparative gene prediciton • rosseta (Batzoglou et al., 2000) • cem (Bafna and Huson, 2000) • sgp1 (Wiehe et al., 2000) • twinscan (Korf et al., 2001) • slam ( Patcher et al., 2001) • sgp2 (Guigó et al., in preparation)

  15. tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons syntenic gene prediction (sgp2)

  16. benchmarking sgp2 - accuracy scimog mit

  17. golden path annotations golden path annotations additional blastn matches to ENSEMBL + REFSEQ additional blastn matches to ENSEMBL + REFSEQ tblastx tblastx geneid exons sgp genes Predicting “novel” genes in the human genome Golden Path Oct 7, 2000 freeze. RepeatMasked TraceDB, as on February 2001

  18. “novel” genes ? • 48,890 genic regions (known genes or similar) • 15,489 genes longer than 100 aa predicted by sgp • 13,302 non redundant predictions • 8,416 supported by tblastx hits to mouse 1.5 • 3,331 predicted genes with at least two exons suported by tblastx hits • + 719 predicted genes supported by tblastx hits covering at least 75% of the prediction 4,050 supported sgp predictions 25% of them not overlapping genscan predictions

  19. validation of predictions

  20. Experimental validation

  21. chr22 human genome vs. Mouse traceDB chr21

  22. human genome vs. Mouse assemblies SN SP CC SNe SPe SNSP ME WE chr22.assem. 0.87 0.65 0.75 0.69 0.54 0.62 0.14 0.33 chr22.shot. 0.82 0.66 0.72 0.63 0.54 0.58 0.20 0.31

  23. testing novel predictions experimentally In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

  24. preliminary results

  25. aknowledgments IMIM-UPF-CRG, Barcelona • Josep F. Abril • Genís Parra • Roderic Guigó GlaxoSmithKline, King of Prussia • Pankaj Agarwal Max Plank Institute for Chemical Ecology, Jena • Thomas Wiehe Whitehead Institute/MIT Center for Genome Research, Cambridge • Gwen Acton • Dan Brown • Kerstin Mouse Sequence Consortium

More Related