1 / 47

Gene Prediction in ENCODE - Roderic Guigó i Serra, CRG-IMIM-UPF, Barcelona

This article discusses the gene prediction process in the ENCODE project, focusing on the criteria used for target selection and the methods employed for manual and random target predictions. It also highlights the importance of experimental verification and compares the gencode annotation pipeline with other gene sets.

normanb
Download Presentation

Gene Prediction in ENCODE - Roderic Guigó i Serra, CRG-IMIM-UPF, Barcelona

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005

  2. Advanced Bioinformatics CHSL, 2005

  3. 1% of the genome. 44 regions • target selection. commitee to select sequence targets • manual targets – a lot of information • radom targets – stratified by non exonic conservation with mouse gene density Advanced Bioinformatics CHSL, 2005

  4. r112 r221 r121 r231 r113 m002 r212 5 4 3 r331 r131 1 2 m011 m010 m009 r334 r223 r123 r332 r114 m013 r323 m012 m001 m003 r222 r321 m014 r312 r232 11 8 9 10 7 12 6 m008 r111 r211 r213 r233 r311 r313 r122 r322 18 16 17 r132 15 14 13 r333 m004 m005 m007 r133 r324 20 21 22 Y 19 m006 X

  5. DNase Hypersensitive Sites DNA Replication Epigenetic  Genes and Transcripts Cis-regulatory elements (promoters, transcription factor binding sites) Long-range regulatory elements (enhancers, repressors/silencers, insulators)

  6. gencode: encyclopedia of genes and gene variants • Roderic Guigó, IMIM-UPF-CRG • Stylianos Antonarakis, GeneveAlexandre Reymond • Ewan Birney, EBI • Michael Brent, WashU • Lior Pachter, Berkeley • Manolis Dermitzkakis, Sanger • Jennifer Ashurst, Tim Hubbard identify all protein coding genes in the ENCODE regions: • identify one complete mRNA sequence for at least one splice isoform of each protein coding gene. • eventually, identify a number of additional alternative splice forms. Advanced Bioinformatics CHSL, 2005

  7. the gencode annotation pipelinemanual curation: havana (sanger)experimental verification:genevabioinformatics: imim

  8. comparison with other gene sets ALL EXONS CODING EXONS Advanced Bioinformatics CHSL, 2005

  9. from the encode Cromatin and Replication Group, John Stamatoyannopoulos Advanced Bioinformatics CHSL, 2005

  10. one gene - many proteinsvery complex transcription units Advanced Bioinformatics CHSL, 2005

  11. chimering tandem transcription / intergenic splicing Advanced Bioinformatics CHSL, 2005

  12. KUA and UEV, Thomson et al., Genome Research 2000 Advanced Bioinformatics CHSL, 2005

  13. systematic search for functional chimeras in ENCODE:165 tandem pairs in the same orientation126 chimeric predictions obtained96 tested, at least 4 positve Parra et al., Genome Research in press Advanced Bioinformatics CHSL, 2005

  14. EGASP’05 • the complete annotation of 13 regions was released in january 30. • The annotation of the remaining 31 regions was being obtained, and it was withheld. • gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions. • 18 groups participated, submiting 30 prediction sets • predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7. Advanced Bioinformatics CHSL, 2005

  15. Advanced Bioinformatics CHSL, 2005

  16. Advanced Bioinformatics CHSL, 2005

  17. EGASP’05 • two main goals: • to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation • how complete is the gencode annotation. are there still genes consistenly predicted by computational methods Advanced Bioinformatics CHSL, 2005

  18. Advanced Bioinformatics CHSL, 2005

  19. accuracy measures Advanced Bioinformatics CHSL, 2005

  20. accuracy at the exon level --coding exons • 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio” Advanced Bioinformatics CHSL, 2005

  21. accuracy at the exon level --all exons • 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio” Advanced Bioinformatics CHSL, 2005

  22. programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), butthe best of the programs predict correctly only 40% of the complete CDS exonic structures, andin about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

  23. programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), butthe best of the programs predict correctly only 40% of the complete transcripts (considering only the coding fraction)in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

  24. the issue of completness

  25. many novel exons predicted:we will prioritize a few hundred for experimental verification using race + rt-pcralthough our experiment in the 13 regions suggests that only a few of them are likely to be real Advanced Bioinformatics CHSL, 2005

  26. many computational predictions outside of the annotation In 13 ENCODE regions: 1255unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334(27%) are outside annotations (could correspond to novel genes) Advanced Bioinformatics CHSL, 2005

  27. many computational predictions outside of the annotation In 13 ENCODE regions: 1255unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334(27%) are outside annotations (could correspond to novel genes) all tested by rt-pcr on 24 tissues 25 (2.0%) confirmed by rt-pcr in 24 tissues 16 (1.2%) with correctly predicted intron junctions 3 (0.2%) outside annotations (1% confirmation) Advanced Bioinformatics CHSL, 2005

  28. Overview of the verification efforts IIAFFX-GenCode: novel regions • 40 intergenic transfrags from HL60 cell line that overlap GenCode gene predictions • 20 overlapping gene predictions with no verification attempted by GenCode • 20 overlapping gene predictions where verification by GenCode was negative • 40 intergenic GenCode gene predictions that do not overlap HL60 transfrags • 20 where no verification was attempted by GenCode • 20 where verification by GenCode was negative (slide by Phil Kaphranov, Affymetrix) Advanced Bioinformatics CHSL, 2005

  29. Gene predictions overlapping transfrags: total 39 (1/40 is a duplicated transfrag) 27 (69%) are positive in HL60 and 31(80%) in HepG2 in the 3’ RACE assays (slide by Phil Kaphranov, Affymetrix) Gene predictions not overlapping transfrags: total 38 (2/40 are outside of the regions where we have probes on the ENCODE array) 18 (47%) are positive in HL60 and 25 (66%) in HepG2 in the 3’ RACE assays Some preliminary stats on the 80 regions: 3’ RACE only Advanced Bioinformatics CHSL, 2005

  30. 3’ RACE based on a predicted exon ENr131_egasp_224555_224677 identifiesnew major and minor exons (shown by arrows) of a gene BC042133 in HepG2 cell line only. Good correspondence between RACE exons and GenScan exons. GenScan HepG2 3’RACE Bottom strand HepG2 3’RACE Top strand Advanced Bioinformatics CHSL, 2005

  31. high-throughput genome-wide unbiased transcription interrogation techniques the encode genes and transcripts group: transfrags, Tom Gingeras (Affymetrix) and Mike Snyder (Yale) cage tags, Albin Sandelin, Riken ditags Yijun Ruan, Genome Insitute of Singapore Advanced Bioinformatics CHSL, 2005

  32. Proteasome (prosome, macropain) 26S subunit, non-ATPase, 4 (inhibits cholera-induced intestinal fluid secretion) Chrom 2 Advanced Bioinformatics CHSL, 2005

  33. protein coding genes are only a fraction of the transcription detected in ENCODE Advanced Bioinformatics CHSL, 2005

  34. transcription (aparently) not associated to protein coding genes TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix) Advanced Bioinformatics CHSL, 2005

  35. inferring novel protein coding genes from transfrags THREADING TRANSFRAGS into PROTEIN CODING GENES Advanced Bioinformatics CHSL, 2005

  36. Advanced Bioinformatics CHSL, 2005

  37. Advanced Bioinformatics CHSL, 2005

  38. Advanced Bioinformatics CHSL, 2005

  39. Advanced Bioinformatics CHSL, 2005

  40. Advanced Bioinformatics CHSL, 2005

  41. Advanced Bioinformatics CHSL, 2005

  42. Advanced Bioinformatics CHSL, 2005

  43. Advanced Bioinformatics CHSL, 2005

  44. Advanced Bioinformatics CHSL, 2005

  45. Advanced Bioinformatics CHSL, 2005

  46. http://genome.imim.es/gencode HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam Frankish David Swarbreck James Gilbert AFFYMETRIX Tom Gingeras Sujit Dike Phil Kaphranov EGASP’05 Michael Ashburner Vladimir Bajic Suzanne Lewis Martin Reese Peter Good Elise Feingold ENCODE France Denoeud (IMIM) Julien Lagarde Josep F. Abril Robert Castelo Eduardo Eyras Stylianos Antonarakis (Geneva) Alexandre Reymond Catherine Ucla Ewan Birney (EBI) Damian Keefe Paul Fliceck Michael Brent (WashU) Lior Patcher (Berkeley) Manolis Dermitakis (Sanger) Advanced Bioinformatics CHSL, 2005

More Related