1 / 38

Monica C. Sleumer ( 苏漠 ) 2012-09-19

Monica C. Sleumer ( 苏漠 ) 2012-09-19. Human Genome. 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes 3–8% of bases are under selection From comparative genomic studies Question: What is the genome doing ?. Objectives.

umika
Download Presentation

Monica C. Sleumer ( 苏漠 ) 2012-09-19

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monica C. Sleumer (苏漠) 2012-09-19

  2. Human Genome • 3,101,804,739 base pairs • 22 chromosomes plus X and Y • 21,224 protein-coding genes • 15,952 ncRNA genes • 3–8% of bases are under selection • From comparative genomic studies • Question: What is the genome doing?

  3. Objectives • Find all functional elements • Bound by specific proteins • Transcribed • Histone modifications • DNA methylation • Use this information to annotate functional regions • Genes (coding and non-coding) • Promoters • Enhancers • Specific transcription factor binding sites • Silencers • Insulators • Chromatin states • Cross-reference data from other studies • Comparative genomics • 1000 Genomes Project • Genome-wide association studies (GWAS)

  4. ENCODE projects • ENCODE pilot project: 1% of the genome 2003-2007 • modENCODE: Drosophila and C. elegans • ENCODE main project 2007-2012 • 1649 dataset-generating experiments • 147 cell types • 235 antibodies and assay protocols • 450 authors • 32 institutes • 31 publications 2012-09-06 • 6 in Nature • 18 in Genome Research • 6 in Genome Biology • 1 in BMC Genetics www.nature.com/encode/category/research-papers

  5. Materials • 147 types of human cell lines, 3 priority levels • Tier 1 cell lines: top priority for all experiments • Tier 2 cell lines to be done after Tier 1 (next slide) • Tier 3: any other cell lines

  6. Tier 2 Cell Lines http://encodeproject.org/ENCODE/cellTypes.html

  7. Methods

  8. Results: RNA Sequencing • 62% of the genome is transcribed into sequences >200 bp long • 5.5% of this is exon • 31% is intergenic – no annotated gene • Remaining: intronic • CAGE-seq: 62,403 TSS • 44% within 100bp of the 5’ end of a GENCODE gene • Others: exons and 3’ UTRs, significance unknown • Lots of short ncRNAs: tRNA, miRNA, snRNA etc. • Further description: Wu Dingming, 9:30

  9. Results: Transcribed and protein-coding regions • GENCODE reference gene set • 20,687 Protein-coding • 6.3 alternatively spliced transcripts on average • 3.9 protein isoforms on average • Protein-coding exons: 1.22% of the genome • Still more to come: unidentified peptides in mass-spec • 18,441 ncRNA genes • 8801 short ncRNA • 9640 long nc RNA • 11,224 pseudogenes • 863 transcribed

  10. ChIP-Seq www.illumina.com/technology/chip_seq_assay.ilmn

  11. ChIP-Seq: Histone modifications

  12. Results: ChIP-Seq • 636,336 binding regions • 8.1% of the genome • Sequence-specific TF ChIP-seq: • 86% of the DNA segments occupied by sequence-specific transcription factors contained a strong DNA-binding motif • 55% cases contained the expected motif • Further description: Qin Zhiyi & Ma Xiaopeng, 13:30

  13. DNase I hypersensitivity • 2,890,000 unique hypersensitive sites (DHSs) • 4,800,000 sites across 25 cell types • Tier 1 and tier 2 cell types: 205,109 DHSs per cell type • 98.5% of ChIP-seq TFBS within DHSs • Further description: GuoWeilong 12:30, He Chao 14:30 https://www.nationaldiagnostics.com/electrophoresis/article/dnase-i-footprinting

  14. FAIRE-seq • Like the opposite of ChIP-seq • Cross-link the nucleosomes to the DNA • But not the sequence-specific TFs • Shear the DNA into small pieces • Remove the protein-bound DNA • Sequence the non-bound DNA Gaulton KJ et al, Nature Genetics 42, 255–259 (2010) doi:10.1038/ng.530

  15. DNA methylation • CpG methylation: regulates gene expression • In promoters: gene repression • In genes: gene transcription • 1,200,000 methylated CpGs in 82 cell lines and tissues • 96% differentially methylated, especially those in genes • Unmethylated genic CpG islands associated with P300 binding , an enhancer-related histone acetyltransferase • Allele-specific methylation: genomic imprinting • Aberrant methylation in cancer cell lines • Reproducible methylation outside CpGdinucleotides http://www.diagenode.com/en/applications/bisulfite-conversion.php

  16. Chromosome conformation capture Montavonand Duboule, Trends in Cell Biology (2012) 22:7, 347–354

  17. Results: Chromosome interactions • Chromosome conformation capture (3C) : • 5C: 3C-carbon copy • ChIA-PET • Identified 127,417 promoter-centred chromatin interactions using ChIA-PET • 98% intra-chromosomal • 2,324 promoters involved in ‘single-gene’ enhancer–promoter interactions • 19,813 promoters were involved in ‘multi-gene’ interaction complexes spanning up to several megabases • 50–60% of long-range interactions occurred in only one of the four cell lines • Further discussion: Li Yanjian, 10:40

  18. Primary Findings • 80.4% of the human genome is doing at least one of the following: • Bound by a transcription factor • Transcribed • Modified histone • 99% is within 1.7 kb of at least one of the biochemical events • 95% within 8 kb of a DNA–protein interaction or DNase I footprint • 7 chromatin states: • 399,124 enhancer-like regions • 70,292 promoter-like regions • Correlation between transcription, chromatin marks, and TF binding • Functional regions contain lots of SNPs • Disease-associated SNPs in non-coding regions tend to be in functional elements

  19. End of Introduction

  20. Summary of ENCODE elements • 80.4% of the human genome is covered by at least one ENCODE-identified element • 62% of the genome is transcribed • 56% of the genome associated with histone modifications • Excluding RNA elements and broad histone elements, 44.2% of the genome is covered • open chromatin (15.2%) • transcription factor binding (8.1%) • 19.4% DHS or transcription factor ChIP-seq peaks across all cell lines • 8.5% of bases are covered by either a transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%) • 4.5x the amount of protein-coding exons (1.2%) • 2x the amount of conserved sequence between mammals • Estimate: 50% of DHS remain to be found • Based on saturation curves

  21. Diversity vs Conservation: Interactive Figure Diversity Conservation A high-resolution map of human evolutionary constraint using 29 mammals Nature 478, 476–482 (2011)

  22. Conservation in Bound Motifs vs Unbound Motifs Diversity Conservation http://www.nature.com/encode/interactive-figures/nature11247_F1

  23. Model of gene expression – histone marks

  24. Model of gene expression – TF binding

  25. Transcription factor co-associations

  26. Seven major classes of genome states

  27. Data integration and genome segmentation Enhancer Transcribed Repressed TSS

  28. Association between genome states and annotations RNA expression Transcription factors Genome segment Genome segment

  29. Enhancer validation in mouse and fish Enhancer from K562 cell (leukemia) drives basal promoter with reporter gene in embryonic mouse blood cells and medaka fish

  30. Genome segment clustering 6 cell types

  31. Genome cluster function Genome state is related to gene function

  32. Allele-specific expression Pol II Txn Rpn

  33. Correlation of allele-specific signal by gene by genomic segment

  34. Genome-wide association studies Annotated disease-causing SNPs Selected TFBS tracks Significantoverlap Control SNPs Diseases No genes, but several TFBS near the disease-causing SNPs

  35. Conclusions • 80% of human genome annotated with at least one association • Protein-binding • Histone modification • Transcription • ENCODE data combination • Model gene expression • Genome segmented into 7 types • Different in each cell line • ENCODE data combined with other data • 1000 genomes: see influence of parental DNA • Genome-wide association studies

  36. Discussion • 147 types of cells, and the human body has a few thousand • 80% functional : controversial • 80% of the genome is being transcribed and/or has a protein bound to it some of the time • Heterochromatin: tightly packed repeat sequences • most of that activity isn’t particularly specific or interesting and may not have impact • Important not to overstate the findings • Ewan Birney: “cumulative occupation of 8% of the genome by TFs” • Reproducibility • In exactly the same cell lines, same conditions, different time or place • Same cell lines, different conditions • Same cell type, different people • Cell lines vs tissue • Cancer vs normal http://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/

  37. Applications • Visible as genome tracks in UCSC • Mutation from • Cancer sequencing • GWAS • Find out what that part of the genome is doing • Compare with your cancer data (RNA-seq) • Comparative genome analysis • Gene or pathway of interest

  38. Online Resources • Interactive graphics in online version of paper • Interactive app on Nature ENCODE main page www.nature.com/encode/

More Related