260 likes | 275 Views
This paper presents a method for sequence alignment and phylogenetic prediction using the Map Reduce programming model in Hadoop Distributed File System (DFS). It discusses the types of sequence alignment, such as pair-wise alignment and multiple sequence alignment, as well as the Needleman-Wunsch and Smith Waterman algorithms. The proposed system implements sequence alignment of query files with target files in DFS and provides the alignment score as the output. The paper also introduces the concept of multiple sequence alignment and discusses methods for producing it, including dynamic programming and progressive alignment construction.
E N D
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi (07MW05) Guided by Dr. G. Sudha Sadasivam Asst. Professor Dept. of CSE
What is Sequence Alignment? The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.
Types of Sequence Alignment • Pair-wise Alignment • Alignment of two sequences • Global –using Needleman Wunsch algorithm. • L G P S S K Q T G K G S _ S R A W D N • | | | | | | | • L N _ A T K S A G K G A I M R L G D A • Local – using Smith Waterman algorithm. • _ _ _ _ _ _ _ _ _T G K G _ _ _ _ _ _ _ _ _ _ • | | | • _ _ _ _ _ _ _ _ _A G K G _ _ _ _ _ _ _ _ _ _ • Multiple Sequence Alignment • Alignment of more than two sequences
NEEDLEMAN WUNSCH ALGORITHM • Initialization F(0, 0) = 0 F(0, i) = −i * d F(j, 0) = −j* d • Main Iteration For each i=1…M and j=1….N • F(i-1,j-1+s(xi,yj), case 1 • F(i,j) = max F(i-1,j)-d, case 2 • F(I,j-1)-d, case 3 • DIAG, if case 1 • Ptr(i,j) = UP, if case 2 • LEFT, if case 3 Case 1: xi aligns to yi Case 2: xi aligns to gap Case 3: yi aligns to gap s(xi,yj ) = +1, match -1, mismatch
Needleman Wunsch Algorithm Case 1: xi aligns to yi Case 2: xi aligns to gap Case 3: yi aligns to gap f(0,0)+s(1,1) =1 F(1,1)=max f(0,1)-1 = -2 f(1,0)-1 = -2 = 1 (case 1) s(xi,yj ) = +1, match -1, mismatch d=1 i=0 1 2 3 4 F(i,j) f(0,1)+s(1,2) =-2 f(0,2)-1 = -3 f(1,1)-1 = 0 Max = 0 (case 3) j=0 1 2 3 A G T A -1 -2 -3 -4 0 • PTR = • DIAG, if case 1 • UP, if case 2 • LEFT, if case 3 0 -1 A -1 1 -2 T -2 0 0 0 1 OptimalAlignment A_TA AGTA Score:1+0+1+2 = 4 A -3 -1 -1 0 2
Smith Waterman Algorithm Initialization: F(0, j) = F(i, 0) = 0 Iteration: 0 F(i, j) = max F(i – 1, j – 1) + s(xi, yj), case 1 F(i – 1, j) – d, case 2 F(i, j – 1) – d, case 3
Smith Waterman Algorithm f(0,0)+s(1,1) =1 F(1,1)=max f(0,1)-1 = -1 f(1,0)-1 = -1 0 = 1 (case 1) Case 1: xi aligns to yi Case 2: xi aligns to gap Case 3: yi aligns to gap s(xi,yj ) = +1, match -1,mismatch d=1 i=0 1 2 3 4 f(0,2)+s(1,3) =-1 F(1,3)=max f(0,3)-1 = -1 f(1,2)-1 = -1 0 = 0 F(i,j) j=0 1 2 3 A G T A 0 0 0 0 0 • PTR = • DIAG, if case 1 • UP, if case 2 • LEFT, if case 3 0 0 0 A 1 0 T 0 0 0 0 1 OptimalAlignment A_TA _ _TA Score: 1+2 = 4 0 A 0 0 0 2
Proposed system Put all files in DFS Input: one query file and a set of sequence files Map Set File Name as Key Pass Entire File contents as Value Do Sequence alignment of query file with the target files in DFS Return (Filename as key, Score as Value). Reduce Combine all the (K,V) pairs Output: (Filename, Score)
Multiple Sequence Alignment • A multiple sequence alignmentis a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. • In general, the input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. • From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.
Methods for producing MSA • Dynamic programming • Progressive alignment construction
Dynamic programming • most direct method for producing an MSA to identify the globally optimal alignment solution . • computational complexity • For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. • The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length.
Progressive alignment construction • uses a heuristic search . • builds up a final MSA by combining pair wise alignments beginning with the most similar pair and progressing to the most distantly related. • The most popular progressive alignment method has been the ClustalW. • All progressive alignment methods require two stages: • a first stage in which the relationships between the sequences are represented as a tree, called a guide tree. • second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree.
Contd… • first step: computation of guide tree from pair-wise alignment scores by an efficient clustering method such as neighbor-joining method. • Second step: The two most similar sequences are aligned first, additional sequences (or groups of sequences) are added later following the guide tree • requires a method to optimally align a sequence with an alignment or an alignment with an alignment Example: According to guide tree, align first sequences 1 and 2, then align sequence 3 to alignment of sequence 1 and 2, then sequence 4 to alignment of sequences 1, 2, and 3. sequence 1 sequence 2 sequence 3 Sequence4
Determination of guide tree using Neighbor-joining method • Neighbor-joining is a bottom-up clustering method used for the construction of phylogenetic trees. • Neighbor-joining is an iterative algorithm. Each iteration consists of the following steps: • Based on the current distance matrix calculate the matrix Q . • For example, if we have four taxa (A, B, C, D) and the following distance matrix:
Contd… • We obtain the following values for the Q matrix: • Find the pair of taxa in Q with the lowest value. Create a node on the tree that joins these two taxa (i.e. join the closest neighbors, as the algorithm name implies).
Calculate the distance of each of the taxa in the pair to this new node. • Calculate the distance of all taxa outside of this pair to the new node. • Start the algorithm again, considering the pair of joined neighbors as a single taxon and using the distances calculated in the previous step.
Drawbacks • The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. • Performance is also particularly bad when all of the sequences in the set are rather distantly related.
Phylogenetic Analysis • An investigation of evolutionary relationships among a group of related sequences by producing a tree representation of relationships. • Significant use-to make prediction concerning tree of life.
Structure • outer branches ->Sequences • Inner part -> Reflect the degree to which sequences are related • Alike sequences -> located at neighboring outside branches • Less related sequences -> more distant from each other
Proposed System • Implementation of Sequence alignment and phylogenetic prediction using map-reduce programming model in hadoop • Algorithms used for Alignment • Global-Needleman Wunsch Algorithm • Local-Smith Waterman Algorithm
Proposed system Put all files in DFS Input: set of sequence files Map Set File Name as Key Pass Entire File contents as Value Do Sequence alignment of all the files with all possible combinations and find the alignment scores Return (Filename as key, Score as Value). Reduce Combine all the (K,V) pairs Output: (Filename, Score) Phylogenetic Analysis
Conclusion • The mapreduce algorithm for pairwise sequence alignment both local and global was completed using the Needleman wunsch and Smith waterman algorithm in Hadoop. • This can be extended to do multiple sequence alignment and to perform phylogenetic analysis in Hadoop for predicting possible evolutionary relationships among a group of related sequences.
Bibliography • David W. Mount, Bioinformatics Sequence and Genome Analysis, second edition • http://apache.org/hadoop • http://wiki.apache.org/hadoop • Map reduce: Simplified data processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat • www.biojava.org • www.biojava.org/wiki/Biojava:CookBook • Biojava in Anger, A Tutorial and Recipe for Those in a Hurry. www.di.unito.it/~botta/didattica/biojavaHowTo.pdf • http://www-sop.inria.fr/oasis/Stages/04-05/BioProActive-Caromel.html • http://hpc.pnl.gov/projects/scalablast/ • http://www.ebi.ac.uk/Tools/clustalw2/