Combining Location, Expression, Conservation in regulatory motif prediction

CombiningLocation, Expression, Conservationinregulatory motif prediction Manolis Kamvysselis May 2001

Motif Discovery: The problem Length of motif Occurrence by chance in a random seq Number of chance occurrences in random 12Mb genome Actual number of occurrences in S.cerevisiae 6 bases long ACCGAT 1 every 4Kb 3,000 1,997 1 wildcard ACCGNT 1 every 1Kb 12,000 7,572 +1 ambiguity AC[GC]GNT 1 in 500b 24,000 13,457 ambiguity or gap AC[GC{}]GNT 1 in 150b 70,000 47,842 • Regulatory motifs are hard to recognize • Sequence is short: Likely to encounter false candidates often • Sequence can have gaps: Second order models are needed • Sequence itself can vary: Individual bases can have more than one alternative • Position of the sequence upstream of the transcription initiation is not fixed • Signal to noise: Motif-like sequences can occur randomly without function

Feature Selection and Classification • Features for motif discovery • Sequence - does my sequence look like known motifs? (ScanACE) • Homology - is it conserved across multiple species? • Expression - is the gene downstream expressed under the expected conditions? • Regulation - classical genetics knowledge about gene’s regulatory factors • Location - what transcription factors are found to bind in the region? • Occurrence - co-occurrence patterns and distance upstream • Clustering - motifs shared with genes in same pathway / functional group? • Classification • Probabilistic framework: each feature has P-value vs. null hypothesis • Classify each sequence fragment as either part of a motif or not • Feature selection can be made based on predictive power for each • Supervised training is possible for motifs in well studied regulatory systems • In other systems, make prediction based on all but one feature, evaluate feature • Independent features can be used to refine error model for individual features

Motif Discovery: The methods Expression Clustering • Cluster co-regulated genes according to expression patterns • Scan upstream regions for common motifs Location analysis • Determine intergenic regions of transcription factor binding • Scan identified regions for common motifs Conservation across multiple species • Align the genomes of related species • Find conserved sequences in intergenic regions Historically: Promoter mapping (not used) • Knock out sections of the upstream region of the gene of interest • Identify regions which disrupt regulation as binding sites

Cluster Patterns of Expression • The clusters depend on • normalization method • clustering algorithm, distance metric • algorithm parameters (threshold, #clusters, iterations) • Error rates: • Sensitivity: 90% of co-expressed cluster • Specificity: 20% of clustered are co-expressed

From Clusters to Motifs • Motifs found depend on • distance upstream that one chooses to consider • expected motif length, threshold for joining motif instances • Gibbs sampling algorithm initialization and convergence • assumption that the same transcription factor binds in all sequences considered

Location Analysis • Advantages for motif discovery • the sequences sampled are actually those bound by the transcription factor of interest • direct observation of binding (not expression) • Limitations • The entire intergenic region is a candidate site • Binding affinity data not quantitative • Error Rates: • Sensitivity: 80% of bound are observed as such • Specificity: 20% of observed are actually bound

Conservation analysis • Disadvantages • Closely related sequences conserved for lack of divergence time • Distantly related species may evolve new regulatory factors and motifs • Motifs depend on • species chosen at right distances • orthologous regions correctly detected and correctly aligned • separating signal from noise S. cerevisiae S. paradoxus S. mikatii K.yarrowii • Alignment specificity • Blast hit without conservation: 1% • No hit despite conservation: 20% • Conservation specificity • Conservation without function: 60% • Function without conservation: 20%

Hit and Conservation specificity • Hit sensitivity • Coverage: 1X = 90% • Hits that can be trusted: 80% • Evolution specificity P(conserved|func) = 90% P(conserved|nonfunc)=60%

Reducing the noise (independence) Length of motif 1 species (S. cerevisiae) P(conserved) 2 species 60% apart P(conserved) 3 species at 60% P(location) Factor binding 6 bases long ACCGAT 1 every 6,000b 1,997 in S.c. 1 in 20 100 in S.c 1 in 400 5 in S.c 1 in 5 sites 1 in S.c 1 wildcard ACCGNT 1 every 1500b 7,572 in S.c. 1 in 13 582 in S.c 1 in 169 45 in S.c 1 in 5 9 in S.c +1 ambiguity AC[GC]GNT 1 every 900b 13,457 in S.c. 1 in 10 897 in S.c 1 in 100 135 in S.c 1 in 5 27 in S.c ambiguity or gap AC[GC{}]GNT 1 in 250 bases 47,842 in S.c. 1 in 4 11,960 in S.c 1 in 16 2,990 in S.c. 1 in 5 598 S.c.

Modeling the dependencies • Binding and Regulation • Regulation data depends on presence binding • Location data depends on binding but also other factors • Conservation data • Multiple species provide extra predictive power • However, species observations are not independent • Dependencies modeled with a phylogenetic tree • Binding and motif conservation • The conservation of a regulatory motif, and the binding of the factor specific to that motif are dependent on functionality of motif • Environmental factors • Binding may occur only in some conditions, not in others

Bayesian network topology

Working with the network • Forward network • Estimate parameters for single models from experience • See how network behaves based on evidence collected • Training based on experience • Fix conditionals for which best estimates are known • Train model on sample data and estimate missing parameters • Exploring alternate topologies • Evaluate optimal P(data|topology) by iterating over parameter space and maximizing P(data|topology,parameters) • Choose topology that best fits data within dimensionality • Feature selection • Based on edge weights in optimal parameter settings, evaluate features according to cost and added information content

What have we learned? • Multiple species are useful • Information content depends on phylogenetic tree topology • Multiple pairwise alignments can add or retract certainty • Select species evolutionary distance based on added performance • The power of Bayesian Networks • Making our assumptions explicit, not everything is independent • Predicting regulatory motifs • Insufficient training data for this project. Only forward network • Pursue training of Bayes network as data becomes available • Future work • Method generalizable to gene prediction, RNA, other features • Integrate more data sources as they become available

Combining Location, Expression, Conservation in regulatory motif prediction

Combining Location, Expression, Conservation in regulatory motif prediction

Presentation Transcript

DNA Regulatory Binding Motif Search

Motif Search and RNA Structure Prediction

Regulatory Motif Finding

Regulatory Motif Finding

Carcinogenicity prediction for Regulatory Use

Regulatory Motif Finding (II)

(Regulatory-) Motif Finding

Table 3: 5’ upstream cis -regulatory motif

Genomic meta-analysis in combining expression profiles

Spatial-Temporal Models in Location Prediction

Mobile Location Prediction in Spatio -Temporal Context

Motif Mining from Gene Regulatory Networks

From motif search to gene expression analysis

User Location Prediction using MLPs

Prediction of Regulatory Elements Controlling Gene Expression

Microarrays, Expression, and Regulatory Networks

Regulatory Motif Finding

Regulatory Motif Finding

Towards RNA structure prediction: 3D motif prediction and knowledge-based potential functions

Regulatory Cascade of Cyclin Gene Expression

Algorithms for Regulatory Motif Discovery

(Regulatory-) Motif Finding