Online Viterbi Algorithm for Analysis of Long Biological Sequences

Online Viterbi Algorithm for Analysis of Long Biological Sequences By NiloofarHezarjaribi

Hidden Marcov Model

Hidden Marcov Model • Hidden Markov Model (HMM) are commonly used for analysis of long genomic sequences • Generative probabilistic model • Linear time viterbi algorithm the most commonly used algorithm • Space complexity is O(mn) • Unsuitable for long sequences

Hidden Marcov Model • HMM composed of states and transitions • Generates sequences over a given alphabet • It uses emission probability and transition probability in each state • HMM defines a joint probability Pr(X,S) • X: given sequence • S: state path that maximizes the joint probability

Hidden Markov Model

Hidden Markov Model • Probability of the path is stored in table P(i,j) • Second last state is stored in B(i,j) • tk(j) transition probability from state k to state j • ej(Xi) emission probability of Xi in state j. • Back Pointer B(i,j) is the value of k that maximizes P(i,j) • After computing these values we have to move from right to left following the back pointers

Hidden Markov Model • For an HMM with m states and a sequence X of length n: Space Complexity: O(nm) running time: O(nm2)

Impractical for long sequences O(mn)???!!

Hidden Markov Model • Example: 250 million symbols 100 states memory: 25 GB Completely impractical!!!

Solutions

Split the sequence • Use of Checkpointing

Proposed Solution?

Online Viterbi Algorithmspace complexity: requires much less memory

Online Viterbi Algorithm

Online Viterbi Algorithm • Represent the back pointer matrix B in the Viterbi algorithm by tree structure. • Parent node of node (i,j) is (i-1, B(i,j)) • We eliminate the node’s that are not in one of the paths ending column i. • The highest probable path is the path from leaf (n,j) which has the highest P(n,j) to the root.

Online Viterbi Algorithm • Paths are not necessarily edge disjoint • Often all the paths share the same prefix up to some node called coalescence point. • After processing D symbols we have to check if the coalescence point has been reached or not. • If not we have to choose one of the potential paths heuristically

Online Viterbi Algorithm

Online Viterbi Algorithm • How to find a coalescence point?? • Maintain compressed version of back pointer tree. • Each node stores the number of its children and a pointer to its parent node. • Keep a linked list of all nodes of the compressed tree ordered by the sequence position. • Keep the list of the pointers to all of the leaves.

Online Viterbi Algorithm While processing the k-th sequence: • First, create new leaf • Second, link it to its parent • Third, insert it into linked list • Once these new leaves created, eliminate all the former leaves that have no children and recursively all the ancestors • Finally, we need to compress the tree

Online Viterbi Algorithm How to compress the tree?? • Examine all the nodes in decreasing order • Delete the nodes with zero or one child • If the node has at least two children we will follow the parent link • Link the node to the first ancestor that has at least two children • The node that doesn’t have an ancestor that has at least two children is a coalescence point • Make it a new root • Output the path till that point and remove it from memory

Online Viterbi Algorithm • Running time of this update: O(m) per sequence position • Representation of compressed tree’s space: O(m) • So the time is not increasing by doing this update • Overhead of this update is less than 5% • Worst case space requires O(mn)

Online Viterbi Algorithm Advantages of this algorithm: • The maximum space requirement: O(mlogn) • Online viterbi leads to significant decrease in memory usage • It can construct the initial segment of the most probable path before the whole process is finished

Memory Requirements of Online Viterbi Algorithm

Memory Requirements of Online Viterbi Algorithm Symmetric two states HMM: • Symmetric two states over a binary alphabet

Memory Requirements of Online Viterbi Algorithm • Assume t < ½ and e < ½ • Configuration of back pointers can be as shown below:

Memory Requirements of Online Viterbi Algorithm • Configuration iv never occurs for t < ½ • Coalescence point occurs whenever one of the configurations ii or iii occur.

Memory Requirements of Online Viterbi Algorithm The upper bound memory requirement is O(mlogn)

Memory Requirements of Online Viterbi Algorithm Multi-state HMM: • In two states each new coalescence point will clear the memory, but multi-state leave a tree of substantial length in the memory • So the sizes of consecutive runs are not independent

Memory Requirements of Online Viterbi Algorithm How to evaluate memory requirements of multi-state HMM: • Generalize the two-state to multiple state • Symmetric HMM with m states emits symbols over m letter alphabet • Each symbol emits one symbol with higher probability • Transition probabilities are equiprobable except the self transitions

Memory Requirements of Online Viterbi Algorithm • Algorithm has been tested for m 6 and sequence gene has been generated by HMM • Data are consistent with logarithmic growth of average maximum memory needed.

Conclusion • Algorithm is based on eﬃcient detection of coalescence points in trees • The algorithm requires variable space that depends on the HMM and on the local properties of the analyzed sequence. • Experiments on both simulated and real data suggest that the asymptotic bound Θ(mlogn) extend to multi-state HMMs, and in fact, for most of the time throughout the execution the algorithm uses much less memory • Algorithm can be used for on-line processing of streamed sequences

Use of Online Viterbi algorithm in My Research • Using Viterbi algorithm in DVFS • Assign low frequency to less busy windows and assign high frequency to busy windows

Use of Online Viterbi algorithm in My Research dynamicEnergy= transitionEnergy + 0.5*C*V2 *t; dynamicTime = activeCycles/VF[i][1] + switchLatency*abs(5 - i); profileEnergy = 0.5*capacity*pow(VF[5][0], 2)*activeCycles; profileTime = activeCycles/VF[5][1]; normalizedEnergy = dynamicEnergy / profileEnergy; normalizedTime = dynamicTime / profileTime; myCost= (1-alpha)*normalizedEnergy + (alpha)*normalizedTime

Use of Online Viterbi algorithm in My Research

Question???

Online Viterbi Algorithm for Analysis of Long Biological Sequences

Online Viterbi Algorithm for Analysis of Long Biological Sequences

Presentation Transcript

Viterbi Detector: Review of Fast Algorithm and Implementation

The Viterbi Algorithm

Alignment of Long Sequences: LAGAN

The viterbi algorithm

Semantic Modeling of Biological Sequences

A temporally abstracted Viterbi algorithm (TAV)

Accelerating Viterbi Algorithm

Analysis of opening sequences

Bioinformatics beyond sequences Knowledge representation and analysis of biological data

Viterbi Sequences and Polytopes

Channel-Independent Viterbi Algorithm (CIVA) for DNA Sequencing

Biological definitions for r elated sequences

Biological sequences and SO

Implementing the Viterbi algorithm on programmable processors

Computational searches of biological sequences

Analysis of the efficiency of Viterbi Algorithm

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Introduction to Biological sequences

Semantic Modeling of Biological Sequences

Bioinformatics beyond sequences Knowledge representation and analysis of biological data