1 / 59

genome.cshlp/content/23/5/777

Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. http://genome.cshlp.org/content/23/5/777. Extension. http://www.nature.com/nature/journal/v489/n7414/full/nature11232.html.

dori
Download Presentation

genome.cshlp/content/23/5/777

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions http://genome.cshlp.org/content/23/5/777

  2. Extension http://www.nature.com/nature/journal/v489/n7414/full/nature11232.html

  3. DNase I hypersensitive site • DNase I hypersensitive sites (DHSs) are markers of regulatory DNA • Discovery of all classes of cis-regulatory elements including enhancers, promoters, insulators, silencers and locus control regions.

  4. Motivation • Understand the transcriptional regulation • Full account of regulatory elements • Genomic locations • Cell-type specificity • Identify of factors that bind them • Targeted genes

  5. Previous works • Target genes of regulatory elements (REs) • Chromatin conformation capture (3C) and its derivatives to detect long-range chromatin loops • 3D chromatin information is locus and cell-type specific, and resolution is poor • Heuristics • Assign elements to the nearest gene which is bounded by gene boundary. • Mapping methods • Correlations between expression and other genomic features to enable distal linking • This work • Explore the linking of REs with DNase I and matched gene expression data

  6. Overview 2.7 million DNase I hypersensitive sites of 72 cell types Gene expression data Chromatin and expression signal correlation corresponds with known long-range interactions Clustering using self-organizing map 1856 clusters JASPAR motif database Classification using a logistic classifier to predict cell-type lineage with 43 DHS inputs Relations with transcription factors Motif discovery Variation in CpG-island, promoter and conserved element overlap

  7. Part 1 DHSs cluster cell types by biological similarity

  8. DHSs cluster cell types by biological similarity • 2.7 million DHSs from 125 samples • 112 samples with DNase-seq and expression data • 72 unique cell types and 15 unique tissue lineages • 1856 unique clusters using SOM on the DHSs data • 50x50 grid • Merge similar clusters

  9. Cluster color: combination of cell types in which the associated DHSs have high signal in the detailed profile. Square size: # of DHSs assigned

  10. Multi-cell-type clusters • Distant lineage relationships • Reuse of regulatory elements • Transformation related to cancer progression • A limit in the resolution of the SOM Cluster color: combination of cell types in which the associated DHSs have high signal in the detailed profile. Square size: # of DHSs assigned

  11. Part 2 SOM clusters capture variation in CpG-island, promoter, and conserved element overlap

  12. SOM clusters capture variation in CpG-island, promoter, and conserved element overlap • Annotated each SOM cluster of REs w.r.t. overlap with • Promoters • CpG islands • Evolutionarily conserved elements

  13. Distribution of conservation, promoters, and CpG islands across clusters Top 100 DHSs in that cluster (ranked by nearness to the cluster center)

  14. Distribution of conservation, promoters, and CpG islands across clusters Top 100 DHSs in that cluster (ranked by nearness to the cluster center)

  15. Distribution of distance to the transcription start site (TSS) of the nearest gene • DNase I signal profiles of five example clusters, showing the distribution of distance to the transcription start site (TSS) of the nearest gene. • Cluster 99 is promoter rich. • Cluster 1259 is preferentially located in an early intron. • Cluster 199 is highly conserved, but not associated with promoters or CpG islands. • Cluster 881 is primarily distal, with no regions within 500 bp of a TSS.

  16. Distribution of the distance from DHSs to TSS varies Top 100 DHSs in that cluster (ranked by nearness to the cluster center)

  17. Part 3 A logistic classifier predicts cell-type lineage with few DHS input

  18. A logistic classifier predicts cell-type lineage with few DHS input • Some REs are highly specific to certain cell types, so a subset of elements could be used as molecular marker. • Build a multinomial logistic classifier that assigns a probability among multiple classes (tissue lineages) • Each cell type is first assigned to one of the 15 primary tissue types based on biological knowledge • Remove all malignant cell types • Restrict the model to the seven tissue types containing at least four samples each, resulting in a training set of 80 samples across 7 classes.

  19. Feature Selection • Assuming that SOM cluster pattern would be a good candidates for differentiating lineages, • Used an initial feature set consisting of 1856 DHSs • One from each cluster that was most similar to the average profile • Result trained classifier can assign the correct tissue lineage with highest probability (>80% accuracy) in leave-one-out cross-validation. • Only 43DHSs are used as features (minimal) with high tissue specificity that can be used to predict tissue identity

  20. Classification Results • Training data • Presumed origin • Without presumed origin • Sex classifier

  21. Classification Result - Training data • Samples from blood and stem cells were never misclassified.

  22. Classification Result – Unseen data • Classifying the malignant samples as well as the five primary cell types left out of the training model Presumed tissue of origin

  23. Classification Result – Unseen data • Glioblastoma, like astrocytes, originates from glial cells. Cancer progression results in an epithelial-like pattern. Presumed tissue of origin

  24. Classification Result – Unseen data • K562 leukemia cell line is weakly associated with multiple lineages (Pr≤30%) • Similarity to undifferentiated red blood cells and using white blood cells to build the model. Presumed tissue of origin

  25. Part 4 DHS clusters are enriched for known and novel transcription factor motifs

  26. DHS clusters and TF motifs discovery • To find groups of sites with similar activity profiles, which may indicate commonly bound transcription factors (TFs) from the clusters. • Used de novo motif discovery to identify enriched motifs and then assigned motifs to specific factors based on the JASPAR (Portales-Casamar et al. 2010) motif database. • 1279 (69%) clusters had at least one significant motif • 918 (49%) clusters had a motif that could be assigned a factor from a database • Alternatively, 1807 significantly enriched motifs were found (some clusters have multiple motifs), of which 1099 (61%) could be assigned a factor.

  27. Some highly cell-type-specific clusters enriched for motifs known to be important for those cell types. Clusters commonly enriched in a specific cell type did not necessarily share similar motifs, indicating that clusters could discern subtle differences in patterns. TCCAC CANNTG ATW Poorly characterized or unknown TFs not yet present in JASPAR or a complex of TFs

  28. Part 5 Motif discovery in similar hematopoietic clusters reveals subtle motif differences

  29. Detected IRF1/IRF2/SPI1-like motifs predominantly in clusters specific to hematopoietic cell lineages, • Variation in DNase I signal intensity among • LCLs • B cell leukemia (CLL) • T cells (CD4, Jurkat, and Th) • megakaryocytes (CMK) • erythroleukemia (K562). Variations in IRF-like motifs in hematopoietic clusters

  30. Possible explanation • Slight variations on the motifs accompanying differences in DNase I signal across hematopoietic cell types. • Differences between IRFs and SPI1 binding • Different cofactors that modulate an IRF's binding preference • Distinct IRFs in specific hematopoietic lineages. • These motif variations represent biological differences in motif preference rather than statistical noise because in other cases (e.g., in the case of CTCF), because they see less variation among discovered motifs across clusters. • We also see similar patterns when looking at an independent set of regions from the same clusters.

  31. Part 6 Motif discovery results are consistent with experimental ChIP data

  32. Motif discovery results are consistent with experimental ChIP data • ChIP data from the ENCODE project to validate discovered motifs • Using representative DHSs from each cluster with enriched motifs, we compared overlap with ChIP peaks from 43 experiments. • Incongruence in overlap between motif and ChIP results • ChIP data come from only a subset of cell types included in the motif analysis. • For example, we compared ChIP results for a single IRF from just three cell types, while our motif analysis considered 14 hematopoietic lineages. Without ChIP data for all cell types, we expect to find many instances of a positive motif result without a corresponding ChIP signal. • Additionally, ChIP reports signal at indirectly bound sites where a motif would not.

  33. Probably due to its cross-cell-type consistency IRFs, SPI1 and RUNX1 are coregulating hematopoietic lineages (Huang et al. 2008). SP1 is a general, promoter-enriched factor with many interacting partners (Kaczynski et al. 2003). There is good correspondence (Mann-Whitney P-values between 10−5 and 10−133) between motif enrichment and ChIP results.

  34. Part 7 Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor

  35. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor • To know whether individual TFs whose motifs are present in several clusters revealed biologically interesting properties about their function. • For each TF, we summarized motif results from all clusters and identified lineage trends. The cell-type specificity for selected motifs motif Biologically relevant tissue

  36. Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor • To characterize the regulatory elements that bind each factor. • Examined the CpG-content, genomic location, and tissue specificity of clusters where each TF motif was enriched

  37. Part 8 Chromatin and expression signal correlation corresponds with known long-range interaction

  38. Identifying target genes for DHSs • If the pattern of a DNase-seq signal across cell types matched the pattern of expression of a gene across cell types, this provided evidence that the gene is a regulatory target of the DHS. • Correlate DHS with gene expression data to infer the target genes (both protein-coding and RNA) for each of the ~2.7M DHSs. Limitations: # of cell types, high-order effects

  39. Findings • About 530k (20%) DHSs correlated significantly with at least one gene within 100kb (permutation P-value < 0.05) • A significant enrichment over the 5% expected by chance • 71% correlate with a single gene but some correlate with as many as 44 genes • 31k Ensembl genes (98%) correlated with at least one DHS • Median 19 • Protein-coding genes tended to have more associations than RNA gens

  40. Correlation between DHS and expression Genes Tie-plot showing the top 50 connections at the beta-globin locus DHSs Tie-plot for the H19/IGF2 locus Red marks below indicate DHSs. Blue bars above represent genes. Connecting lines represent significant correlations, where the width of the lines is proportional to the correlation strength. Far away and crossing multiple gene boundaries

  41. Web Resource • Query, display and extract data • Create a genome browser • http://dnase.genome.duke.edu

  42. Conclusion 2.7 million DNase I hypersensitive sites of 72 cell types Gene expression data Chromatin and expression signal correlation corresponds with known long-range interactions Clustering using self-organizing map 1856 clusters JASPAR motif database Classification using a logistic classifier to predict cell-type lineage with 43 DHS inputs Relations with transcription factors Motif discovery Variation in CpG-island, promoter and conserved element overlap

  43. Q&A

  44. Contribution • The authors integrated chromatin accessibility and expression data from many human cell types. • They used the ENCODE DNase-seq data and clustered more than 2 million DHSs from 112 diverse biological samples by tissue specificity into 1856 chromatin profiles and found each cluster to have a distinct bias relative to • Location • Evoluaionary conservation • CpG islands • Promoter proximity

  45. Contribution • Gene expression profiling + regulatory information • Cell types classification • Assigned 112 samples into tissue groups and developed classifiers to assign tissue type based on Dnase I hypersensitivity patters across the cell-type groups. • Prediction accuracy > 80% in leave-one-out experiments • Similarly, applied on lineage of cancer cell types and sex-specific DHSs

  46. DNase-seq assays identify > 100,000 active Res but do not know the TFs identity • De novo motif discovery

  47. Chromatin and expression signal correlation corresponds with known long-range interaction • Identifying target genes for DHSs • Cross-cell-type correlation among DHSs to identify blocks of similar regulatory elements and coexpressed genes • Correlating distal DHSs with promoter DHSs.

More Related