360 likes | 411 Views
TESS-II. Describing and Finding Gene Regulatory Sequences with Grammars. Jonathan Schug and Christian J. Stoeckert, Jr. Center for Bioinformatics at the University of Pennsylvania . jschug,stoeckrt@pcbi.upenn.edu. Background:. Example 1: Using Bounds and Annotation.
E N D
TESS-II Describing and Finding Gene Regulatory Sequences with Grammars. Jonathan Schug and Christian J. Stoeckert, Jr. Center for Bioinformatics at the University of Pennsylvania. jschug,stoeckrt@pcbi.upenn.edu
Problem: Too many binding sites • The typical search with TESS yields about 1 binding site start per base and a good match every few bases! • Which ones are active?
Combining Signals and Sites • Want to consider: • Pairs or triples (or more) of binding sites, • Sites in particular contexts, e.g., CpG islands, introns, regions of homology, • Physico-chemical properties of DNA sequence
Why Grammars? • Grammars provide a means of describing complex reusable structures and combining them in novel ways. • Analysis of some promoters suggests a modular design. • Grammars are a way of moving beyond dimers to structured descriptions of promoters, enhancers, and entire genes.
Grammars Using the alphabet of DNA, we want to find a grammar for the language of promoters of genes expressed in liver. • Alphabet • finite set of letter symbols • String • a finite sequence of letters from an alphabet. • Language • a set of strings over an alphabet. • Grammar • a formal description of a language.
Example Grammar • A simple example for small portion of English. • A context-free grammar has four components: • Alphabet = {a,b,c,…,z,space} • Symbols = {S,Vp,Av,V,Np,D,Aj,N} • Start symbol = S • ProductionsS -> Np Vp;Np -> D Aj N;Vp -> Av V Np | Av V;D -> a | the | his | her;Aj -> happy | silver | new | ;N -> boy | girl | bicycle | guitar;Av -> slowly | quietly | ;V -> rides | strums;
Derivation Trees Np Vp Av V Np D Aj N D Aj N • Record productions use in generation/parsing of string • Yield a structured description of the string. S The happy boy slowly rides a silver bicycle
Parser Data Flow GrammarRules Flatfile (FASTA) 1 2 Parser Matches(XML) GUS MainStream 5 4 DAS OtherStreams 3 DAS Processing Main Stream GUS
Parser loads grammar from flat file. • Main sequence is extracted from a file, GUS database, or DAS server via a plugin. • Sequence is handed to secondary plugins to populate their streams. • Grammar rules are applied • Matching instances are output in XML.
Data Streams Patterns can be specified in terms of characters, real values, or structured annotation. Main character stream: e.g. DNA or AA acgtagtccgcgcgagcgttagcgagataggcagaatatagca Real value stream: e.g. CG-content 0.31 0.33 0.41 0.64 0.60 0.58 0.51 0.44 0.38 Gene Annotation stream: Genes, Repeats, Homology, etc. Transcript 3’UTR 5’UTR Exon Intron
Annotation Objects • Attributes for stream annotation objects. • All can be used to select annotation in grammar. StreamName::NameValue[namerelvalue,…] for example Gene::Intron[index=1] selects first intron.
Bounding the Size of Matches Expansion of a production is bounded by some interval, either another production or some annotation. A -(P.R)-> B C D; /* bounded subparse */ P ---> Q R S; /* defines context */ A B C D acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc Q S R P
What Genes Are Regulated by CREB? • CREB (cyclic AMP response element binding protein) • Binds as dimer to CRE site with full consensus TGACGTCA or half site CGTCA. • Member of family including CREB, CREM, and ATF-1 • CREB and CREM have several isoforms which may be (conditional) activators, inhibitors, or inducible repressors. • Inducible by a wide variety of signals including cAMP, Ca, growth factors, hypoxia, survival signals, and UV light. • Mutations affect a variety of cell types including circadian rhythm and learning defects. • Thought to bind close to TSS because it interacts with transcription complex via CBP. Collaboration with M. Mackiewizc and A. Pack Center for Sleep and Respiratory Neurobiology UPenn.
Activation of CREB CREB Family Members (Nature Reviews: Molecular Cell Biology August 2001)
TESS Weight Matrix for CREB Information content of weight matrix. Accuracy curves are used to pick thresholds with estimated sensitivity and specificity.
How are CRE sites distributed? • Preliminary question to help identify constraints on active sites. • Use grammar to query RefSeq transcripts aligned to UCSC GoldenPath release 2 of mouse genome and look for consensus sites or very good weight matrix hits.
CREB Grammar /* rules for various qualities of binding sites in an upstream region */ ConsensusCrebs -[Genes::RefGeneGroup]-> tgacgtca; VeryGoodCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<2.1]; GoodCrebs -[Genes::RefGeneGroup[-> TF::CREB[Score<3.591]; MostCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<5.891]; /* Put RefSeq alignments from DAS server in Genes stream. */ Genes <--- DAS --Types='refGene’ --Anchor='upstream’ --UpstreamFlank=-2500 --DownstreamFlank=500; /* Put weight matrix predictions on the fly into TF stream. */ TF <--- WMS --LdMax=6 --File='Data/TESS/WMS/all.wm'; Run parser with this command: FlatPat --Grammar Creb.fp \ --DasServer http://genome.cse.ucsc.edu/cgi-bin/das/hg11 \ > creb.xml
Distribution of CREB Sites • Is not explained by 1- or 2-order Markov models or gaps in sequence. • Confirms and extends initial knowledge. • Similar results obtained in human. • Suggests positional cutoff and confidence in predicted sites. • Suggests total number of regulated genes at 1000-1500 (based on human data).
Why Are Genes Are Expressed in Muscle? • At least five factors are known to be relevant, Myf(MyoD), MEF2, Sp1, SRF, and TEF. • Data from Wasserman and Fickett (1998) indicates that not all factors are required simultaneously. • Consider patterns of subsets of sites. • First check overall distribution of sites in RefSeq upstream regions. • Examined predictive ability of two collections of factors: • SRF, Sp1, and Myf • TEF and Sp1
TEF and Sp1 • TEF is not restricted to muscle • Binds to GGAATC consensus. • Used weight matrix built with Wasserman and Fickett data. • Occurs near Sp1often in training set. MuscleF_Set -{50,GoldenPath::RefGeneGroup}-> TF::TEF,TF::SP1;
SRF, Sp1, and MyoD Have been shown in to interact in human cardiac alpha actin gene centered at about -50bp and spanning 66bp (Biesiada et al MolCelBio 19(4) 1999). SRF Sp1 MyoD /* expanded length, no spacing constraints, same orientation */ Hca_List -[150,GoldenPath::RefGeneGroup]-> TF::SRF[Sense = 1], TF::SP1[Sense = 1], TF::MYF[Sense = 1]; /* expanded length, no spacing constraints, orientation unconstrained*/ Hca_List_Unoriented -[150,GoldenPath::RefGeneGroup]-> TF::SRF, TF::SP1, TF::MYF; /* expanded length, order and orientation not constrained */ Hca_Set -{150,GoldenPath::RefGeneGroup}-> TF::SRF, TF::SP1, TF::MYF;
Production Terms • Literal match • lowercase letters or quoted text • e.g. gata or “gata” • Gap • A period with optional length bounds • e.g. ., .#5, .#{20,30} • Annotation • Stream::Name[attrrelvalue, …] • e.g. BindingSites::Sp1[score<2.1,sense=+1] • Numeric Comparison • Stream:: relvalue • e.g. CG:: > 0.6 • Position • @position or @@position for relative or absolute position • e.g. @1 for start of bounding interval.
List Productions • Gaps between terms are implied. • Numeric or term bounds keep from expanding. A -()-> B, C, D;P -[]-> Q, R, S; A B C D acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc Q R S P
Set Productions • Gaps between terms are implied. • Terms must appear at least once but order is not specified. • Numeric or term bounds keep from expanding. P-[]-> Q, R, S;P -{}-> Q, R, S; P Q R S acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc R S Q Q P P
Bag Productions • Gaps between terms are implied. • Terms must appear at least as many times as specified but order is not specified. • Numeric or term bounds keep from expanding. P-<>-> Q:2, R:2, S:1; P R Q R Q S acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc
GUS.TESS Schema TESS.Moiety Moiety MoietyHeterodimer MoietyMultimer MoietyComplex DoTS.NaFeature TESS.Activity BindingSite ActivityProteinDnaBinding Promoter ActivityTissueSpecificity . . . TESS.Model ModelString ModelConsensusString ModelPositionalWeightMatrix ModelGrammar Sites, weight matrices, grammars, training data, and parses will be stored in GUS30. Will initialize with TRANSFAC and COMPEL. TESS.FootprintInstance TESS.TrainingSet TESS.ParameterGroup DoTS.NaSequence TESS.Note
Integration with GUS • GUS contains schema and data for genomic sequence and RNA expression. • Goal is to store models and instances of known and predicted regulatory regions for specific tissues.
Future Work • Development and evaluation of more patterns. • Experimental validation of predictions. • Expansion of parser to recursive productions. • Inclusion of comparative species analysis.
Related Posters • 146A. The Genomics Unified Schema (GUS). • 114A. Web-Based Biological Discovery using an Integrated Database. • 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?