1 / 30

Gene Regulation and Microarrays

Gene Regulation and Microarrays. Finding Regulatory Motifs. Given a collection of genes with common expression, Find the TF-binding motif in common. Characteristics of Regulatory Motifs. Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds

Download Presentation

Gene Regulation and Microarrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Regulation and Microarrays

  2. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common . . .

  3. Characteristics of Regulatory Motifs • Tiny • Highly Variable • ~Constant Size • Because a constant-size transcription factor binds • Often repeated • Low-complexity-ish

  4. Sequence Logos • Information at pos’n I, H(i) = – {letter x} freq(x, i) log2 freq(x, i) • Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) • Examples: • freq(A, i) = 1; H(i) = 0; L(A, i) = 2 • A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

  5. Probabilistic Motif: Mij; 1  i  W 1  j  4 Mij = Prob[ letter j, pos i ] Find best M, and positions p1,…, pN in sequences Combinatorial Motif M: m1…mW Some of the mi’s blank Find M that occurs in all si with  k differences Or, Find M with smallest total hamming dist Problem Definition Given a collection of promoter sequences s1,…, sN of genes with common expression

  6. Essentially a Multiple Local Alignment • Find “best” multiple local alignment • Alignment score defined differently in probabilistic/combinatorial cases . . .

  7. Algorithms • Combinatorial CONSENSUS, TEIRESIAS, SP-STAR, others • Probabilistic • Expectation Maximization: MEME • Gibbs Sampling: AlignACE, BioProspector

  8. Combinatorial Approaches to Motif Finding

  9. Discrete Formulations Given sequences S = {x1, …, xn} • A motif W is a consensus string w1…wK • Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and any word in xi d(W, S) = i d(W, xi)

  10. Approaches • Exhaustive Searches • CONSENSUS • MULTIPROFILER • TEIRESIAS, SP-STAR, WINNOWER

  11. Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Advantage:Finds provably “best” motif W Disadvantage:Time

  12. Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W* Running time: O( K N2 ) Advantage: Time Disadvantage: If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

  13. CONSENSUS Algorithm: Cycle 1: For each word W in S (of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C1 best alignments, A1, …, AC1 ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …

  14. CONSENSUS Algorithm: Cycle t: For each word W in S For each alignment Aj from cycle t-1 Create alignment (gap free) of W, Aj Keep the Cl best alignments A1, …, ACt ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … … … … ACGGCTC , AGATCTT , GGCGTCT …

  15. CONSENSUS • C1, …, Cn are user-defined heuristic constants • N is sum of sequence lengths • n is the number of sequences Running time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn) = O( N2 + NCtotal) Where Ctotal = i Ci, typically O(nC), where C is a big constant

  16. MULTIPROFILER • Extended sample-driven approach Given a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W* Will use Nα(W) to correct “errors” in W

  17. MULTIPROFILER Assume W differs from true motif W* in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W • L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

  18. MULTIPROFILER Algorithm: For each W in S: For L = 1 to Lmax • Find the α-neighbors of W in S  Nα(W) • Find all “strong” L-long wordlets G in Na(W) • For each wordlet G, • Modify W by the wordlet G  W’ • Compute d(W’, S) Report W* = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

  19. Expectation Maximization in Motif Finding

  20. Expectation Maximization • The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): • Given genomic sequences find all K-long words • Assume each word is motif or background • Find likeliest Motif Model Background Model classification of words into either Motif or Background

  21. Expectation Maximization • Given sequences x1, …, xN, • Find all k-long words X1,…, Xn • Define motif model: M = (M1,…, MK) Mi = (Mi1,…, Mi4) (assume {A, C, G, T}) where Mij = Prob[ letter j occurs in motif position i ] • Define background model: B = B1, …, B4 Bi = Prob[ letter j in background sequence ]

  22. Expectation Maximization • Define Zi1 = { 1, if Xi is motif; 0, otherwise } Zi2 = { 0, if Xi is motif; 1, otherwise } • Given a word Xi = x[1]…x[k], P[ Xi, Zi1=1 ] =  M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K] Let 1 = ; 2 = (1- )

  23. Expectation Maximization Define: Parameter space  = (M,B) 1: Motif; 2: Background Objective: Maximize log likelihood of model:

  24. Expectation Maximization • Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over , 

  25. Expectation Maximization: E-step Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows:

  26. Expectation Maximization: M-step Maximization: Maximize expected value over  and  independently For , this is easy:

  27. Expectation Maximization: M-step • For  = (M, B), define cjk = E[ # times letter k appears in motif position j] c0k = E[ # times letter k appears in background] • cij values are calculated easily from E[Z] values It easily follows: to not allow any 0’s, add pseudocounts

  28. Initial Parameters Matter! Consider the following “artificial” example: x1, …, xN contain: • 212 patterns on {A, T}: A…A, A…AT,……, T… T • 212 patterns on {C, G}: C…C , C…CG,…… , G…G • D << 212 occurrences of 12-mer ACTGACTGACTG Some local maxima: •  ½; B = ½C, ½G; Mi = ½A, ½T, i = 1,…, 12 •  D/2k+1; B = ¼A,¼C,¼G,¼T; M1 = 100% A, M2= 100% C, M3 = 100% T, etc.

  29. Overview of EM Algorithm • Initialize parameters  = (M, B), : • Try different values of  from N-1/2 up to 1/(2K) • Repeat: • Expectation • Maximization • Until change in  = (M, B),  falls below  • Report results for several “good” 

  30. Overview of EM Algorithm • One iteration running time: O(NK) • Usually need < N iterations for convergence, and < N starting points. • Overall complexity: unclear – typically O(N2K) - O(N3K) • EM is a local optimization method • Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.

More Related