1.04k likes | 1.19k Views
Genes and Regulatory Elements. Zhiping Weng U Mass Medical School. ENC yclopedia O f DNA elements (The ENCODE Project Consortium, Science 2004, Nature 2007). r112. r221. r121. r231. r113. m002. r212. 5. 4. 3. r331. r131. 1. 2. m011. m010. m009. r334. r223. r123. r332.
E N D
Genes and Regulatory Elements Zhiping Weng U Mass Medical School
ENCyclopedia Of DNA elements (The ENCODE Project Consortium, Science 2004, Nature 2007) r112 r221 r121 r231 r113 m002 r212 5 4 3 r331 r131 1 2 m011 m010 m009 r334 r223 r123 r332 r114 m013 r323 m012 m001 m003 r222 m014 r321 r312 r232 11 8 9 10 7 12 6 m008 r111 r211 r213 r233 r311 r313 r122 r322 18 16 17 r132 15 14 13 r333 m004 m005 m007 r133 r324 20 21 22 Y 19 m006 X ENCODE Goal: Identify all functional elements in the human genome. Pilot phase: 1% of the genome is being annotated very extensively (30 Mb of sequence). Now genome-wide
The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project Science, Vol 306, 636-640.
Gene RNA-seq
45% repetitive DNA 2% genes (25,000) The human genome 53% Unique and segmental duplicated DNA Where are the gene regulatory elements? G. Crawford
DNase hypersensitive (HS) sites identify active gene regulatory elements DNase I HS sites • Regions hypersensitive to DNase • Promoters • Enhancers • Silencers • Insulators • Locus control regions • Meiotic recombination hotspots HS sites identify “open” regions of chromatin Crawford et al., Nature Methods 2006
or sequence directly. DNase-chip to identify DNase HS sites Crawford et al., Nature Methods 2006
Arrays used for DNase-chip NimbleGen arrays 385,000 50-mer oligos oligos spaced every 38 bases (12 base overlap) non-repetitive unique regions 1% of the genome (44 ENCODE regions) Crawford et al., Nature Methods 2006
DNase-chip Quality Assessment Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J, Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007) Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.
Ubiquitous HS sites 20% Cell-type specific and Common HS sites 80% Unique, common, and ubiquitous DNase HS sites GM CD4 HeLa H9 K562 IMR90 Collectively, the DHS cover 8.3% of the ENCODE regions.
Have we reached saturation in identifying most DNase HS sites?
Ubiquitous DNase HS sites are enriched for promoters (TSS) What about ubiquitous distal DNase HS sites? Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)
Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF ChIP
Antibody against CTCF Tiling array Chromatin-immunoprecipitation (ChIP) - chip Kim T.H. et al. Direct Isolation and Identification of Promoters in the Human Genome Genome Research (2005) Direct sequencing ChIP-seq
Cell culture insulator assays demonstratethat DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.
CTCF sites make up a greater % of ubiquitous distal DNase HS sites than enhancers
Ubiquitous DNase HS sites are enriched for promoters (TSS) Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)
Antibody against histone modification • Tiling array • Sequencing
Enrichment between tissue-specific H3K4me2 and DNase HS sites
Cell type-specific DNase HS sites correlatewith cell type-specific histone modifications Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.
Cell type-specific DNase HS sites correlate with cell type-specific enhancers
Cell type-specific DNase HS sites correlate with cell type-specific gene expression
Transcriptional Motifs Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements
Finding enriched motifs in tissue-specific DNase HS sites Screen against a motif library, e.g., JASPAR or TRANSFAC STAT DHS #1 DHS #2 DHS #3 DHS #4 DHS #5 the Clover algorithm Myc/Max YY1 (etc.)
Raw score 17.3 Clover:Cis-eLement OVERrepresentation Myc/Max DHS sequences
Clover Raw score The Clover AlgorithmFrith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic Acids Res. 32:1372-1381. Lk: nucleotide at position k W: motif width S: a promoter sequence Ms: number of motif locations in a sequence A: all possibilities of choosing a subset of sequences N: the total number of promoter sequences
Control DNA sequences Raw score 9.1 18 17.3 4.2 6.6 Clover:Cis-eLementOVERrepresentation Myc/Max DHS sequences P-value = 1/4
Genome-wide DNase-chip and DNase-sequencing data • CD4 cells • 23 k proximal DNaseI HS sites • 72 k distal DNaseI HS sites
Enriched transcription factor binding motifs in distal DNaseI HS sites • Hematopoietic system: • TAL1 • AML • PU.1 • C/EBPα • Immune system: • STAT1, STAT3, STAT5 • IRF1, IRF3 and IRF5
Identify motif clusters (modules) Distal DHS sequences acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg Find motif clusters in the human genome Enriched motifs
Finding motif clusters with a hidden Markov model Motif Score Location in DNA Red = motif type 1 (e.g. TAL1) Blue = motif type 2 (e.g. ETS) 0.8 Cluster-Buster MC Frith, MC Li, Z Weng (2003). Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Research, 31(13):3666-8. http://zlab.bu.edu/cluster-buster/ 0.1 0.1
Overlap * Sequence space Enrichment of the overlap = DHS * Motif Clusters Overlap between predicted motif clusters and distal DNase HS sites Predicted motif clusters Cutoff DNase HS sites
Motif clusters can predict distal DNase HS sites genome-wide
Summary • DNase HS sites identified from 6 cell types Cell-type specific Common Ubiquitous (found in all cell types studied) • Ubiquitous DNase HS sites are likely to function as… Promoters (TSS) Insulators (CTCF) (no enhancers?) • Ubiquitous sites indicative of housekeeping chromatin structure • Cell-type specific DNase HS sites Correlate with histone modifications in a cell type-specific manner Correlate with gene expression in a cell type-specific manner Correlate with enhancer elements in a cell type-specific manner Contain cell type-specific motifs • Motif clusters can predict DNase HS sites genome-wide
Outline • What is a sequence motif? • Weight matrix representation • Motif search • Motif discovery • Expectation-maximization • Gibbs sampling • Patterns-with-mismatches representation
What is a “Motif”? • Generally, a recurring pattern, e.g. • Sequence motif • Structure motif • Network motif • More specifically, a set of similar substrings, within a family of diverged sequences. • Protein sequence motifs • DNA sequence motifs