1 / 49

Group Meeting

Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Published on May 2 01 3 . Genome Research. Presented by CAO Qin Sep 11 , 2014. Group Meeting. Outline. Biological Background Contributions

Download Presentation

Group Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions Published on May 2013. Genome Research Presented by CAO Qin Sep 11, 2014 Group Meeting

  2. Outline Biological Background Contributions Procedures and Results Conclusions 2

  3. Biological Background

  4. Biological Background • In these regions, chromatin has lost its condensed structure, exposing the DNA, and making it accessible, which is necessary for the binding of proteins such as transcription factors. DNase I hypersensitivity Sites (DHSs) • Regions of chromatin which are sensitive to cleavage by the DNase I enzyme. 4 Image credit: http://greatcourse.cnu.edu.cn/xbfzswx/wlkc/kcxx/14-04.htm

  5. Biological Background Genome-wide DNase-seq experiments • Capture a snapshot of regulatory element dynamics across the multidimensional landscape of cell types, environmental exposures, and developmental stages. • Identify DHSs ENCODE project has made substantial progress defining elements by generating DNase-seq data from more than 100 human cell types 5

  6. Biological Background • Use antibody to “pull down” target DNA, such as DNA bound by a certain protein Image credit: Mardis, Nature Methods 4:613-614, (2007) Slide credit: CSCI5050 Bioinformatics and Computational Biology, Kevin Yip, CSE-CUHK(2013) Chromatin immunoprecipitation (ChIP) • Can be used to locate the binding site 6

  7. Biological Background ChIP vs DNase-seq 7

  8. Biological Background Motifs Definition: patterns that • Appear frequently • May not be exactly the same in different occurrences, but highly similar • Are unlikely to occur “by chance”. In other words, they are “over-represented” • Usually have known or predicted functional roles • Are evolutionarily conserved Example -Transcription factor binding sites, which are short DNA regulatory sequences that frequently appear in specific genomic locations. Some of them are conserved across species. 8 Slide credit: BMEG3102 Bioinformatics , Kevin Yip, CSE-CUHK(2014)

  9. Biological Background • Nucleotide with the highest probability on top • Total height of the nucleotides at the i-th position, • pi,x: probability of character x at position i • n: number of sequences • Height of nucleotide x = pi,xhi Motifs representation • Consensus sequence • Regular expression • Position weight matrix • Sequence logo • … 9 Slide credit: BMEG3102 Bioinformatics , Kevin Yip, CSE-CUHK(2014)

  10. Contributions

  11. Contributions Explore the similarity and dissimilarity of different DHSs across multiple cell types. Uncover distinct associations between DHSs and promoters, CpG islands, conserved elements, and transcription factor motif enrichment. Predict cell-type lineage only based on DHSs. Correlate the target genes with DHSs. 11

  12. Procedures and Results

  13. Procedures and Results DNase I hypersensitive sites cluster cell types by biological similarity. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. A logistic classifier predicts cell-type lineage with few DHS inputs. DHS clusters are enriched for known and novel transcription factor motifs. Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Motif discovery results are consistent with experimental ChIP data. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Chromatin and expression signal correlation corresponds with known long-range interactions 13

  14. Procedures and Results Represent 72 unique cell types and 15 unique tissue lineages By hierarchical clustering DNase I hypersensitive sites cluster cell types by biological similarity. Cluster DHSs andCluster cell types • Genomic locations of 2.7 million DHSs from 125 samples • Select a subset of 112 samples (have both DNase-seq and expression data) • Use self-organizing map(SOM) to cluster DHSs by their profile of hypersensitivity across cell types • Merge highly similar clusters and get 1856 clusters of DHSs 14

  15. Procedures and Results • DNase I hypersensitive sites cluster cell types by biological similarity. • Cluster DHSs andCluster cell types (A)A 50*50 SOM. Each box represents a cluster of DHSs with similar DNase-seq signal profiles across samples, color-coded by tissue. Cluster color corresponds to the combination of cell types in which the associated DHSs have high signal in the detailed profile. Square size indicates the number of DHSs assigned. (B) Average DHS profiles across samples for four individual clusters. An overall cluster profile was defined by calculating the average hypersensitivity of DHSs across different cell types. 15 Figure 1. SOM clustering of DHS profiles.

  16. Procedures and Results • DNase I hypersensitive sites cluster cell types by biological similarity. • Cluster DHSs andCluster cell types “cell-type group” group cell types with increased signal in the averaged DNase I signal profile. (B) Average DHS profiles across samples for four individual clusters. Clusters containing DHSs with high signals in less related cell types (1091 and 1295). highly related cell types (54 and 25) without obvious biological similarity, may indicate distant lineage relationships reuse of regulatory elements 3. transformation related to cancer progression 4. a limit in the resolution of the SOM generally involved cell types with known relationships 16 Figure 1. SOM clustering of DHS profiles.

  17. Procedures and Results DNase I hypersensitive sites cluster cell types by biological similarity. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. A logistic classifier predicts cell-type lineage with few DHS inputs. DHS clusters are enriched for known and novel transcription factor motifs. Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Motif discovery results are consistent with experimental ChIP data. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Chromatin and expression signal correlation corresponds with known long-range interactions 17

  18. Procedures and Results SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. Annotate each SOM cluster of regulatory elements with respect to overlap with promoters, CpG islands, and evolutionarily conserved elements. • Find clear associations between cluster assignment and all three features. 18

  19. Procedures and Results Promoters were defined as 2 kb upstream of the TSS for the UCSC RefGene annotation. Enriched for promoters, CpG islands, and conserved elements, have a stronger DNase I signal across many cell types. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. 19 Figure 2. Distribution of conservation, promoters, and CpG islands across clusters.

  20. Procedures and Results Promoters were defined as 2 kb upstream of the TSS for the UCSC RefGene annotation. Sites are further from the TSS Among clusters with similar promoter overlap, the distribution of the distance from DHSs to TSS varies. Sites are more commonly found just downstream from the TSS (near 5’ introns) SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. 20 Figure 2. Distribution of conservation, promoters, and CpG islands across clusters.

  21. Procedures and Results Promoters were defined as 2 kb upstream of the TSS for the UCSC RefGene annotation. In general, High-conservation & Promoter In general, this finding suggests that DHSs with similar patterns across cell types are likely to share relationships with sequence conservation and genomic location. Outlier: cluster 199 High-conservation & Nonpromoter (Contain ubiquitous distal DHSs that are enriched for CTCF motifs) SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. 21 Figure 2. Distribution of conservation, promoters, and CpG islands across clusters.

  22. Procedures and Results DNase I hypersensitive sites cluster cell types by biological similarity. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. A logistic classifier predicts cell-type lineage with few DHS inputs. DHS clusters are enriched for known and novel transcription factor motifs. Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Motif discovery results are consistent with experimental ChIP data. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Chromatin and expression signal correlation corresponds with known long-range interactions 22

  23. Procedures and Results A logistic classifier predicts cell-type lineage with few DHS inputs. Goal: Use DHSs to predict cell type lineages • Each cell type was first assigned to one of 15 primary tissue types based on known biology. • Removed all malignant cell types and restricted the model to the 7 tissue types containing at least 4 samples each. (Training set : 80 samples across 7 classes) • Initial feature set: consisting of 1856 DHSs: one from each cluster that was most similar to the average SOM cluster profile. • Model: Multinomial logistic classifier assigned the highest probability to the correct tissue lineage with>80% accuracy in leave-one-out cross-validation. 23

  24. Procedures and Results 43 DHSs with positive coefficients in the fitted classifier model • 大動脈/大动脉 • 心肌細胞/心肌细胞 • 成纖維細胞/成纤维细胞 Possible reasons: 1. May reflect the biological similarity between fibroblast and muscle lineages. (Previous studies suggests that it is possible to convert fibroblasts into muscle cells in vitro.) 2. Regulatory element differences among the included smooth, cardiac, and skeletal muscle samples complicate the muscle lineage and may not be captured by the 43 DHSs used by the model. A logistic classifier predicts cell-type lineage with few DHS inputs. Goal: Use DHSs to predict cell type lineages • The final model trained using all samples chose only 43 DHSs as informative features. • The classifier trained using all 80 samples only misclassified two (2.5%) of the 80 samples used to build it: aortic smooth muscle (AoSMC_SFM) and cardiac myocytes (HCM) • In these two cases, the model assigned 30% probability to the correct lineage (muscle), but a higher (albeit still weak) probability to the fibroblast class. 24

  25. Procedures and Results 32 samples (27 malignant samples and 5 primary cell types) 14 presumed to associate 1 of the training lineages 18 presumed to associate NONE of the training lineages 9 were correctly predicted 5 were wrongly predicted For some other brain samples: consist of both neuronal and glial cells 1 K562 leukemia cell line 白血病 4 brain tumors Possible Reasons: • This misclassification may indicate differences between astrocytes and other glial cell types. • A substantial remodeling of glial cell chromatin structure that occurs during cancer progression and results in an epithelial-like pattern. • Previous studies suggest there are glioblastoma cases with epithelial differentiation. 3 specific brain-cell sub-lineages, consisted solely of astrocytes 星形膠質細胞/星形胶质细胞, NOT present in training model (Model: generally not strongly assigned to any lineage (average maximum probability 34%)) 1 brain cancer glioblastoma惡性膠質瘤/恶性胶质瘤 (Model confidently (86%) classified as epithelial. 上皮) Glioblastoma, like astrocytes, originates from glial cells. 神經膠質/神经胶质 A logistic classifier predicts cell-type lineage with few DHS inputs. Goal: Use DHSs to predict cell type lineages • Investigate the remaining data(test data) 25

  26. Procedures and Results 32 samples (27 malignant samples and 5 primary cell types) 14 presumed to associate 1 of the training lineages 18 presumed to associate NONE of the training lineages 9 were correctly predicted 5 were wrongly predicted For some other brain samples: consist of both neuronal and glial cells 1 K562 leukemia cell line 白血病 Presumed to associate with the hematopoietic lineage 造血的 (Model: weakly classified as multiple lineages, none with probability>30%.) 4 brain tumors 3 specific brain-cell sub-lineages, consisted solely of astrocytes 星形膠質細胞/星形胶质细胞, NOT present in training model (Model: generally not strongly assigned to any lineage (average maximum probability 34%)) 1 brain cancer glioblastoma惡性膠質瘤/恶性胶质瘤 (Model confidently (86%) classified as epithelial. 上皮) Glioblastoma, like astrocytes, originates from glial cells. 神經膠質/神经胶质 Possible Reasons: Due to K562’s similarity to undifferentiated erythrocytes (red blood cells) While the hematopoietic lines used to build the model are leukocytes (white blood cells) A logistic classifier predicts cell-type lineage with few DHS inputs. Goal: Use DHSs to predict cell type lineages • Investigate the remaining data(test data) 26

  27. Procedures and Results DNase I hypersensitive sites cluster cell types by biological similarity. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. A logistic classifier predicts cell-type lineage with few DHS inputs. DHS clusters are enriched for known and novel transcription factor motifs. Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Motif discovery results are consistent with experimental ChIP data. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Chromatin and expression signal correlation corresponds with known long-range interactions 27

  28. Procedures and Results DHS clusters are enriched for known and novel transcription factor motifs. One motivation for clustering DHSs was to find groups of sites with similar activity profiles, which may indicate commonly bound transcription factors (TFs). Goal: Analyze the clusters for enrichment of TF motifs • Used de novo motif discovery to identify enriched motifs • Assigned motifs to specific factors based on the JASPAR motif database • 1279 (69%) clusters had at least 1 significant motif • 918 (49%) clusters had a motif that could be assigned a factor from a database • 1807 significantly enriched motifs were found (some clusters have multiple motifs), of which 1099 (61%) could be assigned a factor. 28

  29. Procedures and Results pluripotency factor 多能性 stem-cell-specific hematopoietic-specific important in hematopoietic lineages and leukemia DHS clusters are enriched for known and novel transcription factor motifs. • Highly cell-type-specific clusters enriched for motifs known to be important for those cell types. 29 Figure 4. De novo motif discovery results.

  30. Procedures and Results • Other clusters had similarly high de novo P-values without known motif matches, which reflects poorly characterized or unknown TFs not yet present in JASPAR, or a complex of TFs. DHS clusters are enriched for known and novel transcription factor motifs. • In 39% of the cases, de novo motifs did not convincingly match known motifs in JASPAR, representing possible new or poorly characterized regulators. 30 Figure 4. De novo motif discovery results.

  31. Procedures and Results Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Question: If some clusters are similar(e.g., they are all hematopoietic clusters), whether they have the same motifs? Solution: • Explore hematopoietic clusters • Pick out several hematopoietic clusters with variation in DNase I signal intensity among 8 samples. • These hematopoietic clusters are enriched with certain motifs. • Compare these motifs(whether they are exactly the same or not) 31

  32. Procedures and Results 1.DNA-binding proteins 2. Regulate the entire immune response 3. DNA-binding domain is highly conserved (consensus 5’-AANNGAAA-3’) 4. 9 different IRFs bind slight variations of the core sequence Interferon regulatory factors (IRFs) + Bind to DNA “alone” SPI1 (a hematopoietic factor) Bind to DNA “together”, forming a longer TFBS Motifs for IRFs and SPI1 from JASPAR show both common and distinct features Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Hematopoietic-cluster enriched motifs (biological background) 32

  33. Procedures and Results Figure 5. Variations in IRF-like motifs in hematopoietic clusters. Conclusions: 1.This may be due to differences between IRFs and SPI1 binding, different cofactors that modulate an IRF’s binding preference, or distinct IRFs in specific hematopoietic lineages. 2. Clusters commonly enriched in a specific cell type did not necessarily share similar motifs, indicating that clusters could discern subtle differences in patterns. More likely to be IRF binding “alone” More likely to be IRF binding “together” with SPI1 Motif discovery in similar hematopoietic clusters reveals subtle motif differences. 33

  34. Procedures and Results Some TFs may associate with DNA indirectly through protein partners Motif discovery results are consistent with experimental ChIP data. Use ChIP data from the ENCODE project to validate the discovered motifs • Use representative DHSs from each cluster with enriched motifs (top 100 DHSs that are the nearest to the cluster center) • Compare these DHSs with ChIP peaks from 43 experiments There may be some limitations • ChIP data come from only a subset of cell types included in the motif analysis. -> Many instances of a positive motif result may do not have a corresponding ChIP signal. • ChIP also reports signal at indirectly bound sites where a motif would not. 34

  35. Procedures and Results GM12878 Gliobla 2 cell types • There is good correspondence(p-value) between motif enrichment and ChIP results. • The correspondence is particularly high for CTCF. (due to cross-cell-type consistency) Figure 6. Motif specificity in SOM clusters. Much more cell types Motif discovery results are consistent with experimental ChIP data. Despite these limitations 35

  36. Procedures and Results Figure 6. Motif specificity in SOM clusters. Motif discovery results are consistent with experimental ChIP data. Despite these limitations • There is high overlap among the IRFs, SPI1, and RUNX1 ChIP and motif results, consistent with all three factors co-regulating hematopoietic lineages. 36

  37. Procedures and Results • The SP1-motif clusters overlap not only SP1 ChIP peaks, but also ChIP peaks for most of the other factors, consistent with the role of SP1 as a general, promoter-enriched factor with many interacting partners. Figure 6. Motif specificity in SOM clusters. Motif discovery results are consistent with experimental ChIP data. Despite these limitations 37

  38. Procedures and Results SPI1 Enriched in hematopoietic clusters myogenic 肌生的factor (MYF) Enriched in muscle-specific clusters Figure 6. Motif specificity in SOM clusters. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Goal: Explore whether individual TFs whose motifs are present in several clusters revealed biologically interesting properties about their function • For each TF, they summarized motif results from all clusters and identified lineage trends. • TFs with roles in certain cell types were most often enriched in clusters of relevant tissue lineages. 38

  39. Procedures and Results Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. • In contrast, for the ubiquitously expressed transcription factors SP1, AP-1, and CTCF, they did not have a bias toward a single lineage. 39 Figure 6. Motif specificity in SOM clusters.

  40. Procedures and Results Enriched in clusters with CpG-island promoters that are present in many cell types. -> May partly reflect the GC-rich SP1 motif Each colored square represents a cluster with enrichment for the given motif. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. • Examined the CpG-content, genomic location, and tissue specificity • (SP1, AP-1, and CTCF ubiquitous) 40 Figure 6. Motif specificity in SOM clusters.

  41. Procedures and Results Motif enriched exclusively in non-promoter, non-CpG-island clusters. Found in both tissue-specific and ubiquitous clusters Probably conferred by other factors. Supported by several recent studies. General chromatin-accessibility factor AP-1 may “work” together with other tissue-specific factors. Subunits (FOS and JUN) are ubiquitously expressed Hypothesis: AP-1 may be a pioneer factor that opens DNA for other factors, or it may be an otherwise general and universal chromatin-accessibility factor. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. • (SP1, AP-1, and CTCF ubiquitous) • AP-1: Activating protein 1Most commonly enriched motif, found in ~12% (220) of the clusters It has been implicated in a variety of cellular functions, including cell proliferation, immunity, apoptosis, and differentiation. 41 Figure 6. Motif specificity in SOM clusters.

  42. Procedures and Results DNase I hypersensitive sites cluster cell types by biological similarity. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap. A logistic classifier predicts cell-type lineage with few DHS inputs. DHS clusters are enriched for known and novel transcription factor motifs. Motif discovery in similar hematopoietic clusters reveals subtle motif differences. Motif discovery results are consistent with experimental ChIP data. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor. Chromatin and expression signal correlation corresponds with known long-range interactions 42

  43. Procedures and Results Pearson correlation The pattern of expression of a gene across cell types The pattern of a DNase-seq signal across cell types matches This provided evidence that the gene is a regulatory target of the DHS. Chromatin and expression signal correlation corresponds with known long-range interactions Goal: Identify the target genes for DHSs • About 530,000 of the 2.7 million sites (20%) correlated significantly with at least one gene within 100 kb. • 31,000 Ensembl genes (98%) correlated with at least one DHS, and the median number of DHSs associated to a gene was 19. • Protein-coding genes tended to have more associations than RNA genes. 43

  44. Procedures and Results Maternal vs Paternal Detect strong correlations between the IGF2 gene and several DHSs located in the H19 enhancer region. Figure 7. Correlation between DHS and expression. Red marks below indicate DHSs. Blue bars above represent genes. Connecting lines represent significant correlations, where the width of the lines is proportional to the correlation strength. Chromatin and expression signal correlation corresponds with known long-range interactions Examples • H19/IGF2 ICR (imprinted control region) 44 Image credit: How cohesin and CTCF cooperate in regulating gene expression. Chromosome Research (2009) 17:201–214

  45. Procedures and Results http://dnase.genome.duke.edu/ Web Resource 45

  46. Conclusions

  47. Conclusions The global clustering of DHSs revealed novel open-chromatin pattern relationships among a diverse set of human cell types. Many clusters grouped cell types of common lineage, enabling accurate lineage classifications based on only a few DHSs. Characterizing hypersensitivity across cell types can yield convincing de novo motif discovery results, including identifying novel regulators and new roles for known regulators. This approach provides an unbiased (no a priori knowledge/antibodies required) complement to ChIP. Motif analysis also supports a role for AP-1 regulating open chromatin. 47

  48. Conclusions The motif results highlighted the uniqueness and prominence of CTCF, which is non-promoter and highly conserved. DNase I/expression correlation is a powerful additional source of information to inform models of transcriptional regulation. 48

  49. Q & A THANK YOU!

More Related