140 likes | 309 Views
Combining Location, Expression, Conservation in regulatory motif prediction. Manolis Kamvysselis May 2001. Motif Discovery: The problem. Length of motif. Occurrence by chance in a random seq. Number of chance occurrences in random 12Mb genome. Actual number of occurrences in S.cerevisiae.
E N D
CombiningLocation, Expression, Conservationinregulatory motif prediction Manolis Kamvysselis May 2001
Motif Discovery: The problem Length of motif Occurrence by chance in a random seq Number of chance occurrences in random 12Mb genome Actual number of occurrences in S.cerevisiae 6 bases long ACCGAT 1 every 4Kb 3,000 1,997 1 wildcard ACCGNT 1 every 1Kb 12,000 7,572 +1 ambiguity AC[GC]GNT 1 in 500b 24,000 13,457 ambiguity or gap AC[GC{}]GNT 1 in 150b 70,000 47,842 • Regulatory motifs are hard to recognize • Sequence is short: Likely to encounter false candidates often • Sequence can have gaps: Second order models are needed • Sequence itself can vary: Individual bases can have more than one alternative • Position of the sequence upstream of the transcription initiation is not fixed • Signal to noise: Motif-like sequences can occur randomly without function
Feature Selection and Classification • Features for motif discovery • Sequence - does my sequence look like known motifs? (ScanACE) • Homology - is it conserved across multiple species? • Expression - is the gene downstream expressed under the expected conditions? • Regulation - classical genetics knowledge about gene’s regulatory factors • Location - what transcription factors are found to bind in the region? • Occurrence - co-occurrence patterns and distance upstream • Clustering - motifs shared with genes in same pathway / functional group? • Classification • Probabilistic framework: each feature has P-value vs. null hypothesis • Classify each sequence fragment as either part of a motif or not • Feature selection can be made based on predictive power for each • Supervised training is possible for motifs in well studied regulatory systems • In other systems, make prediction based on all but one feature, evaluate feature • Independent features can be used to refine error model for individual features
Motif Discovery: The methods Expression Clustering • Cluster co-regulated genes according to expression patterns • Scan upstream regions for common motifs Location analysis • Determine intergenic regions of transcription factor binding • Scan identified regions for common motifs Conservation across multiple species • Align the genomes of related species • Find conserved sequences in intergenic regions Historically: Promoter mapping (not used) • Knock out sections of the upstream region of the gene of interest • Identify regions which disrupt regulation as binding sites
Cluster Patterns of Expression • The clusters depend on • normalization method • clustering algorithm, distance metric • algorithm parameters (threshold, #clusters, iterations) • Error rates: • Sensitivity: 90% of co-expressed cluster • Specificity: 20% of clustered are co-expressed
From Clusters to Motifs • Motifs found depend on • distance upstream that one chooses to consider • expected motif length, threshold for joining motif instances • Gibbs sampling algorithm initialization and convergence • assumption that the same transcription factor binds in all sequences considered
Location Analysis • Advantages for motif discovery • the sequences sampled are actually those bound by the transcription factor of interest • direct observation of binding (not expression) • Limitations • The entire intergenic region is a candidate site • Binding affinity data not quantitative • Error Rates: • Sensitivity: 80% of bound are observed as such • Specificity: 20% of observed are actually bound
Conservation analysis • Disadvantages • Closely related sequences conserved for lack of divergence time • Distantly related species may evolve new regulatory factors and motifs • Motifs depend on • species chosen at right distances • orthologous regions correctly detected and correctly aligned • separating signal from noise S. cerevisiae S. paradoxus S. mikatii K.yarrowii • Alignment specificity • Blast hit without conservation: 1% • No hit despite conservation: 20% • Conservation specificity • Conservation without function: 60% • Function without conservation: 20%
Hit and Conservation specificity • Hit sensitivity • Coverage: 1X = 90% • Hits that can be trusted: 80% • Evolution specificity P(conserved|func) = 90% P(conserved|nonfunc)=60%
Reducing the noise (independence) Length of motif 1 species (S. cerevisiae) P(conserved) 2 species 60% apart P(conserved) 3 species at 60% P(location) Factor binding 6 bases long ACCGAT 1 every 6,000b 1,997 in S.c. 1 in 20 100 in S.c 1 in 400 5 in S.c 1 in 5 sites 1 in S.c 1 wildcard ACCGNT 1 every 1500b 7,572 in S.c. 1 in 13 582 in S.c 1 in 169 45 in S.c 1 in 5 9 in S.c +1 ambiguity AC[GC]GNT 1 every 900b 13,457 in S.c. 1 in 10 897 in S.c 1 in 100 135 in S.c 1 in 5 27 in S.c ambiguity or gap AC[GC{}]GNT 1 in 250 bases 47,842 in S.c. 1 in 4 11,960 in S.c 1 in 16 2,990 in S.c. 1 in 5 598 S.c.
Modeling the dependencies • Binding and Regulation • Regulation data depends on presence binding • Location data depends on binding but also other factors • Conservation data • Multiple species provide extra predictive power • However, species observations are not independent • Dependencies modeled with a phylogenetic tree • Binding and motif conservation • The conservation of a regulatory motif, and the binding of the factor specific to that motif are dependent on functionality of motif • Environmental factors • Binding may occur only in some conditions, not in others
Working with the network • Forward network • Estimate parameters for single models from experience • See how network behaves based on evidence collected • Training based on experience • Fix conditionals for which best estimates are known • Train model on sample data and estimate missing parameters • Exploring alternate topologies • Evaluate optimal P(data|topology) by iterating over parameter space and maximizing P(data|topology,parameters) • Choose topology that best fits data within dimensionality • Feature selection • Based on edge weights in optimal parameter settings, evaluate features according to cost and added information content
What have we learned? • Multiple species are useful • Information content depends on phylogenetic tree topology • Multiple pairwise alignments can add or retract certainty • Select species evolutionary distance based on added performance • The power of Bayesian Networks • Making our assumptions explicit, not everything is independent • Predicting regulatory motifs • Insufficient training data for this project. Only forward network • Pursue training of Bayes network as data becomes available • Future work • Method generalizable to gene prediction, RNA, other features • Integrate more data sources as they become available