1 / 34

Comp. Genomics

Comp. Genomics. Recitation 11 SCFG. Exercise. p. 1-p. q. W 1. W 2. 1-q. Different emission probabilities (e.g. DNA compositions). Convert to SCFG. Solution. W 1 aW 1 |cW 1 |…|aW 2 |cW 2 …|tW 2. W 2 aW 2 |cW 2 |…|aW 1 |cW 1 …|tW 1. p(W 1 aW 1 )=e w1 (a)p.

brone
Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comp. Genomics Recitation 11 SCFG

  2. Exercise p 1-p q W1 W2 1-q Different emission probabilities (e.g. DNA compositions) Convert to SCFG

  3. Solution • W1aW1|cW1|…|aW2|cW2…|tW2 • W2aW2|cW2|…|aW1|cW1…|tW1 p(W1aW1)=ew1(a)p p(W1aW2)=ew1(a)(1-p)

  4. Solution • Other rules trivial • Regular CF

  5. Exercise • Convert the production rule WaWbW to Chomsky normal form. If the probability of the original production is p, show the probabilities for the productions in your normal form version.

  6. Solution Old rule: WaWbW • Chomsky normal form requires that all production rules are of the form: WvWyWz or Wza • We define four new non-terminals: W1,W2,Wa,Wb • The new rules are: WW1W2 W1WaW W2WbW Waa Wbb

  7. Solution • For every non-terminal, the sum of probabilities of all production rules must be 1 • Since the new non-terminals have only one rule, their rules will be assigned probability 1 • The rule WW1W2 will therefore have probability p, same as the rule that we eliminated

  8. שאלה 4 ממועד ב', תשע"ב • יהי x=x1x2…xn מחרוזת רנא (RNA) מעל א"ב ACGU. • קיפול דו-ממדי של המחרוזת הוא אוסף של זוגות זרים של אינדקסים בין 1 ל-n שהוא מקונן, כלומר אם מקומותa,b מזווגים, ומקומות c,d מזווגים, וכןa<b, c<d וגם a<c אזי לא ייתכן c<b<d. • בסיס יכול להיות מזווג עם בסיס אחר ברצף, ואם אינו מזווג הוא נקרא חופשי. זיווג ייתכן בין הבסיסים A ל-U ובין C ל-G. עבור זוג (i,j) נגדיר את הזוג (i+1,j-1) בתור הצמוד לו.

  9. המשך שאלה 4 ממועד ב', תשע"ב • נגדיר מודל אנרגטי פשוט של קיפול בצורה הבאה: • אם בסיס חופשי, אין לו תרומה אנרגטית. • אם בסיס מזווג, יש לו תרומה (שלילית) אך ורק אם הזוג הצמוד לו גם מזווג. • יש לתאר אלגוריתם תכנון דינמי יעיל ככל האפשר המוצא קיפול הממקסם את מספר הזוגות המזווגים שהזוג הצמוד להם גם הוא מזווג (היינו קיפול בעל אנרגיה מינימלית).

  10. פתרון שאלה 4, מועד ב', תשע"ב • נגדיר A(i, j) קיפול עם אנרגיה מינימלית בין i ל-j. • W(i, j) = קיפול עם אנרגיה מינימלית, כש-i ו-j לא מזווגים. • V(i, j) = קיפול עם אנרגיה מינימלית, כש-i ו-j מזווגים.

  11. המשך פתרון שאלה 4, מועד ב', תשע"ב • A(i, j) =max(W(i,j), V(i,j) ) if xi and xj can be paired, W(i,j) otherwise • W(i,j) = max{i≤k<j} (A(i,k)+A(k+1,j)) • V(i,j) = max(1+V(i+1, j-1), W(i+1, j-1)) if xi and xj can be paired, W(i+1, j-1) otherwise • A(i,i) = 0, A(i,i+1) = 0

  12. EM algorithm for SCFG • Initial estimate. • Calculate expectations: E(X->YZ), E(X) • Update rule: Pt+1(X->YZ)=E(X->YZ)/E(X) • Repeat until convergence.

  13. Probability calculation| x,Θ • The probability that state v is used as a root in the derivation of xi,…,xj: • The probability the rule vyz is used in deriving Xij (v is the root):

  14. Expectation calculation • The expected number of times state v is used in a derivation: inside outside • The expected number of times the rule vyz is used:

  15. EM for SCFG • How to compute the new probability for vyz? • What about va?

  16. Example • Suppose that our data contains the following sentence: S V T1 N P N V N PP N He hangs pictures without frames

  17. Example The sentence was generated using the following production rules: SNV with probability p(SNV) VVN … NNP … PPPN NHe Vhangs Npictures PPwithout Nframes

  18. Example • The likelihood of this sentence is: We believe in our sentence! We start with some initial probabilities and want to the likelihood of the sentence using the EM algorithm

  19. Example • To make it more interesting, let’s add another production rule: VVNP S T2 V P N V N PP N He hangs pictures without frames

  20. Example • But now the grammar is no longer Chomsky normal form • We will turn it into Chomsky normal form as • follows: • VV N-P p(VV N-P)=p(VVNP) • N-PN P p(N-PN P)=1.0

  21. Example • Compute inside probabilities

  22. Example He N V hangs N pictures PP without N frames

  23. Example He N S V V hangs N pictures PP P without N frames

  24. Example Box(1,3) accounts for substring 1-3 He N S S S V V V hangs N N,N-P pictures PP P without Box(3,5) accounts for substring 3-5 N frames

  25. Example • Compute outside probabilities

  26. Example X X Y Z Z Y k i-1 i j i j j+1 k

  27. Example He S V hangs pictures without N frames

  28. Example Let’s improve p(VVN). The expected number of times it is used:

  29. Example The expected number of times that V is visited: This is actually the same as:

  30. Example • In order to get the new p(VVN), we divide and get: • Similarly, for p(Vhangs), we get:

  31. The CYK algorithm • Initialization: for i=1…L, v=1…M: • Iteration: for i=1…L-1, j=i+1…L, v=1…M • Termination: score of optimal parse tree π* for sentence x

  32. The CYK algorithm • Looks similar to the inside algorithm, but we take the maximum instead of summing (consider the forward algorithm vs. Viterbi)

  33. Summary M: SCFG symbols, Q: HMM states, L: Data length |Q|2L |M|3L3 |Q|2L |M|3L3 |Q|2L |M|3L3

  34. Summary |Q|L |M|L2 |Q|L |M|L2 |Q|L |M|L2

More Related