360 likes | 495 Views
Online Viterbi Algorithm for Analysis of Long Biological Sequences. By Niloofar Hezarjaribi. Hidden Marcov Model. Hidden Marcov Model. Hidden Markov Model (HMM) are commonly used for analysis of long genomic sequences Generative probabilistic model
E N D
Online Viterbi Algorithm for Analysis of Long Biological Sequences By NiloofarHezarjaribi
Hidden Marcov Model • Hidden Markov Model (HMM) are commonly used for analysis of long genomic sequences • Generative probabilistic model • Linear time viterbi algorithm the most commonly used algorithm • Space complexity is O(mn) • Unsuitable for long sequences
Hidden Marcov Model • HMM composed of states and transitions • Generates sequences over a given alphabet • It uses emission probability and transition probability in each state • HMM defines a joint probability Pr(X,S) • X: given sequence • S: state path that maximizes the joint probability
Hidden Markov Model • Probability of the path is stored in table P(i,j) • Second last state is stored in B(i,j) • tk(j) transition probability from state k to state j • ej(Xi) emission probability of Xi in state j. • Back Pointer B(i,j) is the value of k that maximizes P(i,j) • After computing these values we have to move from right to left following the back pointers
Hidden Markov Model • For an HMM with m states and a sequence X of length n: Space Complexity: O(nm) running time: O(nm2)
Impractical for long sequences O(mn)???!!
Hidden Markov Model • Example: 250 million symbols 100 states memory: 25 GB Completely impractical!!!
Split the sequence • Use of Checkpointing
Online Viterbi Algorithmspace complexity: requires much less memory
Online Viterbi Algorithm • Represent the back pointer matrix B in the Viterbi algorithm by tree structure. • Parent node of node (i,j) is (i-1, B(i,j)) • We eliminate the node’s that are not in one of the paths ending column i. • The highest probable path is the path from leaf (n,j) which has the highest P(n,j) to the root.
Online Viterbi Algorithm • Paths are not necessarily edge disjoint • Often all the paths share the same prefix up to some node called coalescence point. • After processing D symbols we have to check if the coalescence point has been reached or not. • If not we have to choose one of the potential paths heuristically
Online Viterbi Algorithm • How to find a coalescence point?? • Maintain compressed version of back pointer tree. • Each node stores the number of its children and a pointer to its parent node. • Keep a linked list of all nodes of the compressed tree ordered by the sequence position. • Keep the list of the pointers to all of the leaves.
Online Viterbi Algorithm While processing the k-th sequence: • First, create new leaf • Second, link it to its parent • Third, insert it into linked list • Once these new leaves created, eliminate all the former leaves that have no children and recursively all the ancestors • Finally, we need to compress the tree
Online Viterbi Algorithm How to compress the tree?? • Examine all the nodes in decreasing order • Delete the nodes with zero or one child • If the node has at least two children we will follow the parent link • Link the node to the first ancestor that has at least two children • The node that doesn’t have an ancestor that has at least two children is a coalescence point • Make it a new root • Output the path till that point and remove it from memory
Online Viterbi Algorithm • Running time of this update: O(m) per sequence position • Representation of compressed tree’s space: O(m) • So the time is not increasing by doing this update • Overhead of this update is less than 5% • Worst case space requires O(mn)
Online Viterbi Algorithm Advantages of this algorithm: • The maximum space requirement: O(mlogn) • Online viterbi leads to significant decrease in memory usage • It can construct the initial segment of the most probable path before the whole process is finished
Memory Requirements of Online Viterbi Algorithm Symmetric two states HMM: • Symmetric two states over a binary alphabet
Memory Requirements of Online Viterbi Algorithm • Assume t < ½ and e < ½ • Configuration of back pointers can be as shown below:
Memory Requirements of Online Viterbi Algorithm • Configuration iv never occurs for t < ½ • Coalescence point occurs whenever one of the configurations ii or iii occur.
Memory Requirements of Online Viterbi Algorithm The upper bound memory requirement is O(mlogn)
Memory Requirements of Online Viterbi Algorithm Multi-state HMM: • In two states each new coalescence point will clear the memory, but multi-state leave a tree of substantial length in the memory • So the sizes of consecutive runs are not independent
Memory Requirements of Online Viterbi Algorithm How to evaluate memory requirements of multi-state HMM: • Generalize the two-state to multiple state • Symmetric HMM with m states emits symbols over m letter alphabet • Each symbol emits one symbol with higher probability • Transition probabilities are equiprobable except the self transitions
Memory Requirements of Online Viterbi Algorithm • Algorithm has been tested for m 6 and sequence gene has been generated by HMM • Data are consistent with logarithmic growth of average maximum memory needed.
Conclusion • Algorithm is based on efficient detection of coalescence points in trees • The algorithm requires variable space that depends on the HMM and on the local properties of the analyzed sequence. • Experiments on both simulated and real data suggest that the asymptotic bound Θ(mlogn) extend to multi-state HMMs, and in fact, for most of the time throughout the execution the algorithm uses much less memory • Algorithm can be used for on-line processing of streamed sequences
Use of Online Viterbi algorithm in My Research • Using Viterbi algorithm in DVFS • Assign low frequency to less busy windows and assign high frequency to busy windows
Use of Online Viterbi algorithm in My Research dynamicEnergy= transitionEnergy + 0.5*C*V2 *t; dynamicTime = activeCycles/VF[i][1] + switchLatency*abs(5 - i); profileEnergy = 0.5*capacity*pow(VF[5][0], 2)*activeCycles; profileTime = activeCycles/VF[5][1]; normalizedEnergy = dynamicEnergy / profileEnergy; normalizedTime = dynamicTime / profileTime; myCost= (1-alpha)*normalizedEnergy + (alpha)*normalizedTime