190 likes | 315 Views
Finding regulatory modules: A statistical approach. Mikhail Velikanov Linnaeus Centre for Bioinformatics. Introduction. Regulatory modules (RMs): sets of regulatory sites that work cooperatively TF binding sites and promoter elements Splicing enhancers and suppressors “Site clusters”
E N D
Finding regulatory modules: A statistical approach Mikhail Velikanov Linnaeus Centre for Bioinformatics
Introduction • Regulatory modules (RMs): sets of regulatory sites that work cooperatively • TF binding sites and promoter elements • Splicing enhancers and suppressors • “Site clusters” • site A AND (site B OR site C) AND (NOT site D) • “Beads on a spring” • site B is 20 ± 3 bp downstream of site A • Distance distributions have a short range and a well-defined peak
Searching for RMs: Setup of the Problem Motifs Annotations • Seq. length constant and small (~0.5 kb) • Num. of sites ~20 • No overlapping sites • Sites characterized by: • Identity • P-values (≤ pt) • shown by width Look for annotation patterns that occur consistently in all or some of the sequences.
RMs as Annotation Alignments • Align sites by identity • Find sequences of 2 or more sites shared across a number of annotations (common annotations) Conditions: • Distances between sites are similar • P-values of aligned sites are similar • P-values of aligned sites are small Need a function that measures how well conditions (1-3) are satisfied (strength of common annotation).
Strength of common annotation: site p-values • Assume a common annotation of S sites supported by N sequences • For the i-th site, let pimin, pimax be the smallest and the largest of the N p-values • pimax: measure of how small p-values are • Ri = pimax/pimin: measure of similarity • Probability πi of observing p-values as similar and as small in N random annotations
Strength of common annotation:distances between sites • Account for no overlaps between sites • renormalization of πi for each site • π0 = 1 - ∑πi : positions between sites • Compute approximately probability of common annotation PCA as a function of π0, π1, …, πS • Strength of common annotation Z = -ln PCA S ~ ~ i=1 ~ ~ ~
Searching for the strongest common annotations • Given an input set of annotations, define groups of annotations such that • each group has at least one common annotation • the strongest common annotation of each group is distinct • NB: Groups may fully or partially overlap! Cannot use standard clustering algorithms.
Classification Algorithm • Find pairs of annotations with at least one common annotation • Each pair is a nucleus of a potential group • Each group grows by adding annotations one at a time • the group retains its strongest common annotation at each step • each addition maximizes the group strength • annotation added to one group remains available for addition to other groups Where does the growth stop? (strength = group strength, Zg)
Stopping criterion • No more annotations can be added • group contains all annotations in the input set • change in the strongest common annotation • Formed during growth of another group • ignore current group (“pruning”) • Group strength is too small • adding an unrelated annotation • group strength Zg is a score (Zg > 0) • can be computed for groups of random annotations • by the extremal types theorem lim Prob(Zgrand > Zg) = 1 - exp[-(Zg/b)-a] • threshold on Zg • numerical calibration of a, b for all possible N, S n → ∞
From annotation groups to RMs • Need a way to: • account for optional sites • search for homologous RMs
RMs as generalized HMMs • Generalized (duration) HMMs (gHMMs or dHMMs) consist of 2 types of states • motif states (PSSMs) • annotation sites • spacer states (distance distributions) • gaps between sites • States are connected according to certain topology • Transitions probabilities depend only on the present state • Common annotations of groups are simple gHMMs
S S S0 S3 S3 S1 E S2 E RMs as generalized HMMs • Common annotations define gHMM states • Overlaps define topology and provide estimates of transition probabilities • Multiple matches to the model Can make a single model because of the overlap!
S Annotations S0 S3 S1 S2 E From annotations to RMs RMs
Testing the Method: Test 1 • 25 random DNA sequences, 20 are “seeded” with an RM • 2 sites with low p-value (< 10-3) separated by 20 – 25 bp • Scan sequences with unrelated motif subject to p-value threshold • 3rd site (random noise in annotations) m0 m1
m0 m1 m3 m4 Testing the Method: Test 2 • 25 random DNA sequences, 2 non-overlapping groups of 10 and 11 sequences • each group is “seeded” with a distinct RM (2 sites) • distance between sites is 20 – 25 bp or 52 – 55 bp • Extra site added as before
m0 m1 m3 m4 Testing the Method: Test 3 • 25 random DNA sequences, 2 overlapping groups of 12 and 14 sequences • same RMs as in previous test • groups overlap by 5 sequences • Extra site added as before
Summary • A method for discovery of regulatory modules given a set of annotated sequences • Builds RMs from recurrent annotation patterns • Treats site p-values and distances in consistent statistical framework • Can use prior information on RMs (Bayesian approach) • RMs are output as gHMMs • flexibility of RMs structure (topology) • searching for homologous RMs
Future developments • Testing the method on real data • upstream regions of bacterial operons • bacterial Fe-regulons • other benchmark sets? • Algorithm improvements • better stopping criterion (use properties of distance distributions) • more precise computation of common annotation strength • better similarity measure for site p-values (reduce compensation)
Acknowledgements Thanks to David Ardell (LCB, Uppsala) and Georgiy Sofronov (Univ. of Queensland, Brisbane) for many fruitful discussions