1 / 47

Computational Analysis of Tissue Specificity: Decoding Promoters

Computational Analysis of Tissue Specificity: Decoding Promoters. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania Nov. 17, 2004 Department of Physiology Seminar Series University of Kentucky. Expression. TFBS1. TFBS2. TFBS3. TFBS4. TFBS1.

rusk
Download Presentation

Computational Analysis of Tissue Specificity: Decoding Promoters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania Nov. 17, 2004 Department of Physiology Seminar Series University of Kentucky

  2. Expression TFBS1 TFBS2 TFBS3 TFBS4 TFBS1 TFBS2 TFBS3 TFBS4 http://molbio.info.nih.gov/molbio/gcode.html TFBS = transcription factor binding site What is the code for determining where (and when) a gene is expressed?

  3. Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or CRMs) that Specify Tissue Expression From Wasserman & Sandelin, NRG 2004

  4. A Genomics Unified Schema approach to understanding gene expression Dave Barkan, Jonathan Crabtree, Shailesh Date, Steve Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian He, John Iodice, Li Li, Junmin Liu, Matt Mailman, Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

  5. Plasmodium Genome Resource Stem Cell Gene Anatomy Project GUS Beta Cell Biology Consortium Allgenes (human and mouse DoTS)

  6. Java Servlets Oracle RDBMS Object Layer for Data Loading DoTS RAD TESS SRES Core GUS is an open source project U. Penn Sanger Institute U. Georgia U. Toronto U. Chicago Flora Centromere Database Phytophthora sojae genome GUS Virginia Bioinformiatics Insitiute

  7. Namespace Domain Features DoTS Sequence and annotation EST clusters and gene models RAD Gene Expression MIAME/MAGE-OM TESS Gene Regulation TFBS organization Sres Shared Resources Ontologies Core Data Provenance Documentation GUS (Genomics Unified Schema) http://www.gusdb.org

  8. Identify shared TF binding sites Genomic alignment and comparative sequence analysis SRES BioMaterial annotation RAD EST clustering and assembly DoTS TESS

  9. DoTS integrates sequence annotation including where expressed

  10. kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line brain, liver, kidney, lung, melanocyte embryo, fetus, kidney, limb, retina, salivary gland brain, rhabdomyosarcoma cell line, kidney DoTS integrates sequence annotation including where expressed Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding - GO cellular component – actin cytoskeletal stress fibers

  11. RAD Contains Detailed Expression Experiments Including Tissue Surveys

  12. But there are too many potential sites! TESS Allows You to Find Potential TFBS

  13. Promoters Features Related to Tissue-Specificity as Measured by Shannon Entropy Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan3, Christian J. Stoeckert Jr.1 Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

  14. What is a Liver-Specific Gene? http://expression.gnf.org/ *

  15. Assessing Tissue Specificity of Genes Using Shannon Entropy Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity. To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression. (a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450 (b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7

  16. Agreement between Microarrays and ESTs on Tissue Specificity

  17. Specificity Characteristics of Tissues

  18. CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

  19. Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions CpG- CpG+ Multi-Tissue H >= 4.4 Tissue Specific H <= 3.5

  20. TATA Boxes are Associated with Tissue-Specific Genes

  21. Functional relationships of promoter classes based on over-represented GO terms (EASE)

  22. First Clues: TATA Box indicates Tissue Specific; CpG indicates Wide Spread Expression Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

  23. Pattern Analysis of Pancreas Gene Promoters Guang (Gary) Chen, Jonathan Schug

  24. Identifying TFBMs – Method Pipeline Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified GNF Gene Expression Atlas Shannon Entropy DBTSS Gene Lists with Tissue Specificity Sequences around Transcription Start Sites Gene Ontology (GO) Gene Clusters Teiresias Represent Seqs with PWMs GO Category Analysis Patterns Comparative Genome Analysis Tissue Specific Regulatory Modules Associated with GO Biological Process Pattern Clustering Pattern Clusters (PWM)

  25. Methods & Resources (Cont.) • DBTSS: Database of Transcriptional Start Sites • Based on 400,225 and 580,209 human and mouse full length cDNA sequences, DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. http://dbtss.hgc.jp/ Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: 328-331. • Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB) • 500bp upstream for preliminary study • 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n)) • This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules

  26. Method- Pattern Discovery - Teiresias Teiresias Patterns • A Teiresias Pattern P is a <L,W> pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length. *Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.

  27. Identifying TFBMs - Pattern Distribution With 117 human pancreas specific genes (Qpancreas <6.5, positive(p)) and 117 human ubiquitous genes (Qpancreas >10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆p-n >20 (in blue box) are more likely to be pancreas specific Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis) For each pattern (x-axis), the occurrence difference ∆p-n (y-axis) between positive (Q<6.5) and negative (Q>10) data set

  28. Method - Pattern Clustering Pattern Clustering Patterns Pattern Clustering Smith-Waterman Hierarchical Distance of pattern pair K-Median Num of Cluster Pattern Clusters (PWM)

  29. Results - Pattern Clustering Clustering Results (human, ∆p-n>20, 72 patterns)

  30. Identifying TFBMs Identified known binding sites associated with human pancreas genes AP2ALPHA SRY MEF2 NKX62 CAP_01 HOXA3 72 patterns (Human, ∆p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC.

  31. Identifying TFBM By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs AP2ALPHA NKX62 MEF2 HOXA3 CAP_01

  32. Gene Clustering - Based on TFBMs pancreas specific genes can be clustered according to presence or absence of conserved promoter motifs Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.

  33. Gene Clustering – GO Category Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.

  34. More Clues: Known and novel TFBS found associated with genes expressed in the pancreas See conservation of sites between human and mouse Associated with digestion, catabolism, and response to stimulus GO biological processes

  35. Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores Elisabetta Manduchi, Jonathan Schug

  36. Genes Biological Process Tissue If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?

  37. For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q. • To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set. • The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.

  38. Application to a Human Tissue Survey • The following results refer to the application of the methods described above to the GeneNote tissue survey: • 12 tissues in duplicate on the HGU95 Affymetrix chip set (Av2, B-E). • We looked at the 2316 GO BPs that we could map to probe sets (using version 1.5.1 of the Bioconductor GO and hgu95XXX metadata R packages).

  39. significant in liver GO BPs having significantly specific profiles for each tissue can be identified significant in heart and skeletal muscle

  40. Excerpt of cluster of GO BPs based on their tissue-specificity profiles (up in spinal cord/brain)

  41. Focusing on steroid metabolism • After mapping probe sets to RefSeqs and retrieving from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set. • We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences. • We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC -represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.

  42. We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets. These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top. By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes. Focusing on steroid metabolism

  43. TSS green=forward strand red=reverse strand shading indicates strength Example of production hits to steroid metabolism mouse promoter sequences • Production • TFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2} • Parameters: • max distance=130 • FOXD3_01 min score=9.934705 • GKLF_01 min score=10.815614 • HFH1_01 min score=9.442617 • MADSA_Q2 min score=8.246301 No. mouse promoter sequences: 6875. Of these 50 belong to genes mapping to steroid metabolism. No. production hits: 257. Of these 8 belong to genes mapping to steroid metabolism.

  44. More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues Identified a candidate CRM for steroid metabolism.

  45. Summary • GUS is a functional genomics database system used by a growing number of sites for genome and expression projects. • Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles • In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.

  46. Future Directions • Refine analysis from genes to transcripts • Refine analysis from organs to cells • Apply approach to splicing • Apply approach to developmental stage and differentiation state Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".

  47. http://www.cbil.upenn.edu

More Related