1 / 49

ENCODE: understanding our genome

ENCODE: understanding our genome. Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence. ENCODE experiments. ENCODE Pilot. Considered too expensive and too risky to decide on winning technologies (started in 2004)

britain
Download Presentation

ENCODE: understanding our genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence

  2. ENCODE experiments

  3. ENCODE Pilot • Considered too expensive and too risky to decide on winning technologies (started in 2004) • 1% of the genome (30MB) chosen - all experiments on the same 1% • Pilot phase ended • Analysis and publication • Scale up to genome wide now funded

  4. A lot of Chip/Chip

  5. Nowdays, a lot of Chip/seq

  6. Transcription

  7. Transcription • Lots of it • And not all of it genes • And even when it is inside a gene, not all of it with open reading frames • And even when it has an open reading frame, not all of it making sense! (evolutionary or structurally) • Not technical false positives

  8. Protein coding loci are far more complex than we think • On average 5 transcripts per locus • Many do not encode proteins (as far as we can see) • Even the ones which do encode proteins, many of these proteins look “weird”

  9. Unplausible structures

  10. Many effects on potential function

  11. Signal peptides, TM Helices • 1097 protein transcripts from 487 loci • 219 have signal peptides (107 loci) • 12 loci have an isoform without the signal peptide • 41 transcripts have a gain or loss of a tansmembrane helix (sometimes up to 8!)

  12. The Clade B Serpins Potential Missing fragments a inactive, "stressed" bactive (beta inserted) (c) (e) (f) (d)

  13. Transcription Start Sites

  14. Gencode Manual Ann. Unbiased TxFrag Ditag data Cage data Histone mod. Dnase I sens Sequence sp Factors (eg Myc) Technologies on TSS

  15. Integration Strategy 16,051 unique TSS Anchor on 5’ ends GenCode 5’ and CAGE/DiTag 8,587 TSS “tight clusters” Categorise and assess using Transcript based evidence Exons, TxFrags, CpG islands 5 different classes First 4 low-Pvalues Assess categories with Histone and TF data First 4 categories have Biological signals: 4,491 TSS

  16. TSS Categories

  17. GenCode 5’ ends

  18. Unsupported tags

  19. Novel TSSs

  20. Conclusion • There are 4,418 TSS with multiple lines of evidence supporting them • This is ~10 fold more than the number of Genes • Only 38% would be traditionally classified as TSS (less if one took Ensembl or RefSeq)

  21. Implications of many more TSSs • Consistent with considerable diversity of transcripts • Independently integrating Chip/Chip data suggested ~1,000 “Regulatory Clusters” • 25% proximal considering Ensembl/Refseq • 65% when this TSS catalog is considered

  22. More subtle conclusions • Sequence specific factors are distributed symmetrically around the TSS • Should we only be taking upstream regions for reporter genes? • Histone information is highly correlated with gene on/off status • Generalising many locus specific studies

  23. Gene On/Off

  24. Gene status prediction

  25. Distal sites

  26. Finding distal sites • Chip/Chip not “great” • Most look close to one of these new TSSs • Factor bias? • DNaseI Hypersenstive Sites • All factors give a DHS signal • 55% of DHSs are distal to any TSS

  27. Distal DHS

  28. Most surveyed factors are proximal

  29. Replication

  30. H3K27me3 is correlated

  31. Evolutionary conservation and ENCODE

  32. Evolutionary conservation

  33. …but not everything is constrained

  34. False positives in the experiments But experiments validate at >80% and cross-validate each other False negatives in the constraint detection But can detect up to 8bp elements, and within “neutral” zone of alignability Neutral turnover model Why is there a discrepancy?

  35. Neutral biochemical events Time

  36. Lineage specific Time

  37. Mouse “Functional” conservation Human

  38. Constrained sequence Gene Regulatory Information Constrained sequence Pre-miRNAs Special case: Transcription

  39. What should we learn from ENCODE • “whacky” transcription is real (but god knows what it does) • Unconventional Transcript • Lots more TSSs than we understand • Many “distal” regions are actually close to promoters • Broad specificity marks are more useful • DNaseI sites, Histone marks

  40. Neutral model for biochemical events on the genome • Because things happen reproducibly in multiple tissues does not imply selection • (this is not the same as experimental variance) • Could imply “functional” conservation outside of orthologous bases • Comparative genomics sequencing not enough (but a great starting point!) • Comparative functional investigation

  41. Consortia work • ENCODE • Experimentally lead consortia • Needs a lot of computational collaboration • Biosapiens • Computationally lead consortia • Needs experimental collaboration (!) • DNA: ENCODE • Protein: Biosapiens

  42. What happens next?

  43. Ensembl Regulatory Build Chr 14, 5677077-567896 elements GM06990 Cells, Myc bound Status

  44. Initial Regulatory Build • DNaseI Hypersenstive sites, 6 histone modifications, CTCF binding • ~110,000 elements, ~2MB of DNA • 6,000 “promoter associated” by inherent pattern (DNaseI + H3K36me3) • Available now • This year: Mouse, More classification

  45. Regulatory build

  46. Ensembl - at your service • Web browser www.ensembl.org • MySQL DB access • BioMart • “Geek for a week” • You send someone to use for a week • Xose for a day • We send someone to you for a day

  47. The ENCODE Project Consortium The Biosapiens Network of Excellence

More Related