170 likes | 256 Views
Parallelizing HMM Decoding. Lab Meeting 10/19/04. Standard Iscan. CPoint algorithm. CPoint Advantages. Allow us to decode the trellis for an entire sequence of arbitrary length in a mathematically correct way.
E N D
Parallelizing HMM Decoding Lab Meeting 10/19/04
CPoint Advantages • Allow us to decode the trellis for an entire sequence of arbitrary length in a mathematically correct way. • Make it possible to correctly identify annotated genes that cannot currently be predicted (genes which cross a fragment boundary)
CPoint Problems • Large sequences must be run on single processor (1MB splits made running Twinscan on large sequences parallelizable) • Parameters have been hand tuned to run on 1MB fragments and produce slightly worse results (as compared to refseqs)
Parallel HMM Decode Algorithm • Calculate most probable path from each state at time Start to each state at time End. • If all path pass though some trellis cell we know the optimal path moves through this point. • If we search for and find multiple points like this throughout the sequence then we can break the sequence at these points and find the best path between each pair of adjacent points on different processors.
Naïve Implementation • Create n trellis structures (one for each state) • Initialize each trellis so that one state has probability 1 and all other states have probability 0. • Run Viterbi on each trellis simultaneously until all nodes at the current time in each trellis trace back to the same point.
Proof • Need to show that traceback resulting from forcing probabilities to 1 at time Start cover all paths back from End regardless of Start probabilities. • If we introduce uncertainty at Start (change the probabilities) no new tracebacks are created (no tracebacks that aren’t in one of the n trellises). • Need to show that changing probabilities of one column to remove uncertainty won’t change the optimal path.
Step 2 Proof • Show that removing uncertainty in column 1 doesn’t change optimal path. • We will actually prove that for any set of initial probabilities any traceback which ends in state j will not change if we change the initial probabilities such that all states except for state j decrease or stay the same relative to state j.
Proof • Let p(i,j) be the probability at column i state j in the trellis. • Let ∆(i,j) be the change in probability at column i state j in the trellis. • Let t(i,j) be the transition probability from state i to state j • First we increase (or do not change) the probability of state j in column i and decrease (or do not change) the probability of all other states. ∆(i,j) ≥ 1 ∆(i,k≠j) ≤ 1
Proof • Each state k in column i+1 which previously traced back to state j will still do so. ∆(i+1,j) = ∆(i,j) ≥ 1 • Any state k in column i+1 which traced back to some other state l has three options after the probability change in column i • Continue to trace back to l • Change and trace back to j • Change and trace back to some new state m
Proof • Still trace back to l ∆(i+1,k) = ∆(i,l) ≤ 1 • Change and trace back to j p(i,j)t(j,k) ≤ p(i,l)t(l,k) ∆(i,l)p(i,l)t(l,k) ≤ ∆(i,j)p(i,j)t(j,k) ≤ ∆(i,j)p(i,l)t(l,k) ∆(i+1,k) ≤ ∆(i,j) • Change and trace back to some new state m p(i,m)t(m,k) ≤ p(i,l)t(l,k) ∆(i,l)p(i,l)t(l,k)≤∆(i,m)p(i,m)t(m,k)≤ p(i,m)t(m,k) ≤ p(i,l)t(l,k) ∆(i+1,k) ≤ 1
Proof • The probability of every state k in column i+1 which traced back to state j at time i still traces back to j and increases by ∆(i,j) • All other states have either decreased or increased by at most ∆(i,j) • If we divide ∆(i+1,*) by ∆(i,j) then for any state k in column i+1 on a path tracing back though (i,j) ∆(i+1,k) >= 1 ∆(i+1,l≠k) <= 1
Step 1 Proof • We need to show that if we trace back paths back from each state at time End n times, each beginning with all probability in a singe state at time Start, and all paths converge at some point p in the trellis, then any trace back from End will go through p regardless of the probability at Start. • All Viterbi paths from End to Start pass though p regardless of Start column probabilities.
Step 1 Proof • Set probabilities at time Start to anything and fill the trellis. • Traceback from any state j at time End will go to some state i at time Start. • The path starting was covered when we started with p(1,i) = 1.
Coalescence Point Performance Issues • Run 1MB chunks and full chromosomes 1, 15, 20, 21, and 22 (Human NCBI34). • Full runs have runs of N longer than 1,000,000 removed