280 likes | 366 Views
A Statistical Method for Finding Transcriptional Factor Binding Sites. Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss. Regulation of Gene Expression. Difficulties of Motif Finding.
E N D
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss
Difficulties of Motif Finding • Regulatory sequences don’t follow same orientation as the coding sequence or each other • Multiple binding sites might exist for each regulated gene • Large variation in the binding sites of a single factor. Variations are not well understood.
Previous & Proposed Methods for Finding Motifs • Previous Methods: • Find longer, general motifs • Use local search algorithms (Gibbs sampling, Expectation Maximization, greedy algorithms) • Proposed Method: • TFBS is small enough to use enumerative methods • Enumerative statistical methods guarantee global optimality and affordability
Proposed Method Highlights • Allows variations in the binding site instances of a given transcription factor • Allows for motifs to include “spacers” • Allows for overlapping occurrences (in both orientations), which lends to complex dependencies • Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides • Use of Markov chain to model background genomic distribution • Use of z-score to measure statistical significance • Allows for multiple binding sites
Characteristics of a Motif • Any single TFBS has significant variation • Many motifs have spacers from 1-11bp • Variation often occurs as a transition (e.g. purine purine) rather than a transversion (e.g. pyrimidine purine) • Variation occurs less between a pair of complementary bases. • Indels are uncommon
Proposed Motif Definition • Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N} • A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer) • TF database (SCPD) confirms this model of variation • Of 50 binding site consensi, 31 exact fits (62%) • Another 10 fit if slight variations allowed
Measure of Statistical Significance • Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation. • Model must measure from input sequences: • Absolute number of occurrences (Ns) of motif (s) • Background genomic distribution • X is a set of random DNA sequences in the same number and lengths of the input sequences • Generated by Markov chain of order m • Transition probabilities determined by (m+1)-mer frequencies in fully complement of 6000+ (800bp in length) • Background model chooses m=3
z-score • Xs– r.v. is number of occurrences of motif (s) in X • E(Xs) – expectation, σ(Xs) – standard deviation • zs – number of S.D. by which observed value Ns exceeds expectation
Implications • Possibility of overlap of a motif with itself (in either orientation) • Previous study of pattern autocorrelation • Generalized computation of SD, treating motif as a finite set of strings • Higher order Markov chains • Spacers handled at no extra computational cost • Handles motif in either orientation
Algorithm • Enumerates over each input sequence • Tabulates number Nsof occurrences of each motif in either direction • Compute expectation and SD for each motif s.t. Ns>0 • Calculate z-score • Rank motifs by z-score
Algorithm Analysis • For single motif, complexity is O(c2k2) • k – # of nonspacer characters in motif • c – # of instantiations of R, Y, S, W in motif • Only modest values of k • Linear dependence on genome size • Can trim variance calculation to optimize
Number of Occurrences • Convert motif s into a multiset W • Add reverse complements for each string in W • Motif s only occurs at position in X iff some string in W occurs at same position • Xs - # of occurrences (in X) of each member of W • Handling Palindromes • Wi – member of W • |W| = T
Expectation • Linearity of Expectation
Variance B term C term
C Term A term
Overlapping Concatenation • CW (like W) is potentially a multiset • One-to-one correspondence
Si1Si2 Term & Approximation • Kleffe and Borodovsky (1992) Approximation
Higher Order Markov Models • Variance calculations remain the same except for Si1Si2term • Experimental m = 3
Experimental Results & Future Considerations • 17 coregulated sets of genes • Known TF with known binding site consensus • In 9 experiments, known consensus was one of 3 highest scoring motifs • Future Topics: • Non-centered spacers • Enumeration Loop optimization • Filtering repeats
Question • E(Xs) is more straight-forward to calculate compared to σ(Xs). Under the assumptions given in the paper, name one of the reasons for this complication.