1 / 14

Combining Location, Expression, Conservation in regulatory motif prediction

Combining Location, Expression, Conservation in regulatory motif prediction. Manolis Kamvysselis May 2001. Motif Discovery: The problem. Length of motif. Occurrence by chance in a random seq. Number of chance occurrences in random 12Mb genome. Actual number of occurrences in S.cerevisiae.

lani
Download Presentation

Combining Location, Expression, Conservation in regulatory motif prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CombiningLocation, Expression, Conservationinregulatory motif prediction Manolis Kamvysselis May 2001

  2. Motif Discovery: The problem Length of motif Occurrence by chance in a random seq Number of chance occurrences in random 12Mb genome Actual number of occurrences in S.cerevisiae 6 bases long ACCGAT 1 every 4Kb 3,000 1,997 1 wildcard ACCGNT 1 every 1Kb 12,000 7,572 +1 ambiguity AC[GC]GNT 1 in 500b 24,000 13,457 ambiguity or gap AC[GC{}]GNT 1 in 150b 70,000 47,842 • Regulatory motifs are hard to recognize • Sequence is short: Likely to encounter false candidates often • Sequence can have gaps: Second order models are needed • Sequence itself can vary: Individual bases can have more than one alternative • Position of the sequence upstream of the transcription initiation is not fixed • Signal to noise: Motif-like sequences can occur randomly without function

  3. Feature Selection and Classification • Features for motif discovery • Sequence - does my sequence look like known motifs? (ScanACE) • Homology - is it conserved across multiple species? • Expression - is the gene downstream expressed under the expected conditions? • Regulation - classical genetics knowledge about gene’s regulatory factors • Location - what transcription factors are found to bind in the region? • Occurrence - co-occurrence patterns and distance upstream • Clustering - motifs shared with genes in same pathway / functional group? • Classification • Probabilistic framework: each feature has P-value vs. null hypothesis • Classify each sequence fragment as either part of a motif or not • Feature selection can be made based on predictive power for each • Supervised training is possible for motifs in well studied regulatory systems • In other systems, make prediction based on all but one feature, evaluate feature • Independent features can be used to refine error model for individual features

  4. Motif Discovery: The methods Expression Clustering • Cluster co-regulated genes according to expression patterns • Scan upstream regions for common motifs Location analysis • Determine intergenic regions of transcription factor binding • Scan identified regions for common motifs Conservation across multiple species • Align the genomes of related species • Find conserved sequences in intergenic regions Historically: Promoter mapping (not used) • Knock out sections of the upstream region of the gene of interest • Identify regions which disrupt regulation as binding sites

  5. Cluster Patterns of Expression • The clusters depend on • normalization method • clustering algorithm, distance metric • algorithm parameters (threshold, #clusters, iterations) • Error rates: • Sensitivity: 90% of co-expressed cluster • Specificity: 20% of clustered are co-expressed

  6. From Clusters to Motifs • Motifs found depend on • distance upstream that one chooses to consider • expected motif length, threshold for joining motif instances • Gibbs sampling algorithm initialization and convergence • assumption that the same transcription factor binds in all sequences considered

  7. Location Analysis • Advantages for motif discovery • the sequences sampled are actually those bound by the transcription factor of interest • direct observation of binding (not expression) • Limitations • The entire intergenic region is a candidate site • Binding affinity data not quantitative • Error Rates: • Sensitivity: 80% of bound are observed as such • Specificity: 20% of observed are actually bound

  8. Conservation analysis • Disadvantages • Closely related sequences conserved for lack of divergence time • Distantly related species may evolve new regulatory factors and motifs • Motifs depend on • species chosen at right distances • orthologous regions correctly detected and correctly aligned • separating signal from noise S. cerevisiae S. paradoxus S. mikatii K.yarrowii • Alignment specificity • Blast hit without conservation: 1% • No hit despite conservation: 20% • Conservation specificity • Conservation without function: 60% • Function without conservation: 20%

  9. Hit and Conservation specificity • Hit sensitivity • Coverage: 1X = 90% • Hits that can be trusted: 80% • Evolution specificity P(conserved|func) = 90% P(conserved|nonfunc)=60%

  10. Reducing the noise (independence) Length of motif 1 species (S. cerevisiae) P(conserved) 2 species 60% apart P(conserved) 3 species at 60% P(location) Factor binding 6 bases long ACCGAT 1 every 6,000b 1,997 in S.c. 1 in 20 100 in S.c 1 in 400 5 in S.c 1 in 5 sites 1 in S.c 1 wildcard ACCGNT 1 every 1500b 7,572 in S.c. 1 in 13 582 in S.c 1 in 169 45 in S.c 1 in 5 9 in S.c +1 ambiguity AC[GC]GNT 1 every 900b 13,457 in S.c. 1 in 10 897 in S.c 1 in 100 135 in S.c 1 in 5 27 in S.c ambiguity or gap AC[GC{}]GNT 1 in 250 bases 47,842 in S.c. 1 in 4 11,960 in S.c 1 in 16 2,990 in S.c. 1 in 5 598 S.c.

  11. Modeling the dependencies • Binding and Regulation • Regulation data depends on presence binding • Location data depends on binding but also other factors • Conservation data • Multiple species provide extra predictive power • However, species observations are not independent • Dependencies modeled with a phylogenetic tree • Binding and motif conservation • The conservation of a regulatory motif, and the binding of the factor specific to that motif are dependent on functionality of motif • Environmental factors • Binding may occur only in some conditions, not in others

  12. Bayesian network topology

  13. Working with the network • Forward network • Estimate parameters for single models from experience • See how network behaves based on evidence collected • Training based on experience • Fix conditionals for which best estimates are known • Train model on sample data and estimate missing parameters • Exploring alternate topologies • Evaluate optimal P(data|topology) by iterating over parameter space and maximizing P(data|topology,parameters) • Choose topology that best fits data within dimensionality • Feature selection • Based on edge weights in optimal parameter settings, evaluate features according to cost and added information content

  14. What have we learned? • Multiple species are useful • Information content depends on phylogenetic tree topology • Multiple pairwise alignments can add or retract certainty • Select species evolutionary distance based on added performance • The power of Bayesian Networks • Making our assumptions explicit, not everything is independent • Predicting regulatory motifs • Insufficient training data for this project. Only forward network • Pursue training of Bayes network as data becomes available • Future work • Method generalizable to gene prediction, RNA, other features • Integrate more data sources as they become available

More Related