Parallelized Multiple Sequence Alignment on the Public Cloud

Parallelized Multiple Sequence Alignment on the Public Cloud Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, Coimbatore Co-authors Mr B. Vijayan, Mr S. Arul Prakash, Mr K.V. Hari Babu Students, BE(CSE), Dept of CSE, PSG College of Technology, Coimbatore

Agenda • Sequence alignment • Introduction to Clouds • Approaches for MSA • Problem statement • System Architecture • Illustration of working of the system • Analysis • Experimental results • Conclusion

What is Sequence Alignment? • The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. • Uses • For sequence similarity • Phylogenetic tree analysis • Factors – accuracy and speed

Cloud computing Provides scalable, on-demand, RT computing services Suitability of cloud for Sequence Alignment • On-demand scalability of cloud makes it suitable for dynamic nature of MSA • Low cost in maintenance of infrastructure for applications • Data and compute parallelism in clouds through map-reduce paradigm facilitates energy efficient and fast MSA.

Types of Sequence Alignment • Pair-wise Alignment • Alignment of two sequences • Global –using Needleman Wunsch algorithm. • L G P S S K Q T G K G S _ S R A W D N • | | | | | | | • L N _ A T K S A G K G A I M R L G D A • Local – using Smith Waterman algorithm. • _ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _ • | | | • _ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _ • Multiple Sequence Alignment • Alignment of more than two sequences

MSA methods N- sequence length; n- number of sequences

MSA in cloud • CloudBurst – RMAP • Does not split sequences to load in cloud environment • Not for MSA • No automatic scale up/down of clusters • CLUE- proposal from Maryland University • VM cloning – Snowflock with MPIs

Problem statement Time efficient approach to sequence alignment with quality (accuracy) in Cloud • Using hadoop framework • Dynamic approach  accuracy • Data and compute parallelism in hadoop  speed • Blocking and scalability of hadoop • Parallel transfer of sequence splits over the network to remote clusters • Automated scale up/down of clusters based on computational needs of th environment.

Needleman Wunsch Algorithm • Initialization F(0, 0) = 0 F(0, i) = −i * d F(j, 0) = −j* d • Main Iteration For each i=1…M and j=1….N • F(i-1,j-1)+s(xi,yj), case 1 • F(i,j) = max F(i-1,j)-d, case 2 • F(i,j-1)-d, case 3 • DIAG, if case 1 • Ptr(i,j) = UP, if case 2 • LEFT, if case 3 Case 1: xi aligns to yi Case 2: xi aligns to gap Case 3: yi aligns to gap s(xi,yj ) = +1 , match -1 , mismatch

Needleman Wunsch Algorithm Optimal Alignment A_TA AGTA f(0,0)+s(1,1) =1 F(1,1)=max f(0,1)-1 = -2 f(1,0)-1 = -2‏ = 1(case 1) f(0,1)+s(1,2) =-2 f(0,2)-1 = -3 f(1,1)-1 = 0 Max = 0 (case 3) i=0 1 2 3 4 F(i,j)‏ j=0 1 2 3 A G T A F(i-1,j-1)+s(xi,yj) F(i-1,j)-d F(i,j-1)-d -1 -2 -3 -4 0 A -1 1 0 -1 -2 • F(0, 0) = 0 • F(0, i) = −i * d • F(j, 0) = −j* d T -2 0 0 1 0 A -3 -1 -1 0 2 s(xi,yj ) = +1, match -1, mismatch d=1 • PTR = • DIAG, if case 1 • UP, if case 2 • LEFT, if case 3 Case 1: xi aligns to yi Case 2: xi aligns to gap Case 3: yi aligns to gap

Multiple Sequence Alignment • A multiple sequence alignmentis a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. • The input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. • From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.

MSA Approaches • Dynamic programming • Progressive alignment • Iterative approach

Dynamic Programming • Direct method for MSA to identify the globally optimal alignment solution . • Computational complexity • n-dimensional equivalent of the pairwise alignment matrix is formed. • The search space increases exponentially with increasing n and is strongly dependent on sequence length(N). • O(Nn)

Progressive Alignment • According to guide tree, • Align seq 1 and 2, • Align seq 3 wrt seq 1 and 2, • Align seq 4 to that of seq 1, 2, and 3. seq 1 seq 2 seq3 seq4 • Heuristic search . • builds up a final MSA by combining pair wise alignments beginning with the most similar pair and progressing to the most distantly related. • Stages: • The relationships between the sequences are represented as a tree, called a guide tree (pairwise alignment scores). • The MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree.

Drawbacks • The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result.  Random/ iterative approaches are used • Performance is also particularly bad when all of the sequences in the set are rather distantly related.

System Architecture 4. Forking VMs / deleting VMs New VMs 2. Parallel transmission over Internet 3. Copy to HDFS AGT….CG AGT….CG Head Server (VM) AGT….CG New VMs AGT….CG AGT….CG ………. . . 5. Perform Alignment SEQUENCE FRAGMENTS 1. Create virtual environment 2. Split the sequences New VMs 6. Report the result SERVER SIDE HADOOP CLUSTER CLIENT SIDE VIRTUAL ENVIRONMENT

Map Task 3 Map Task 1 Map Task 2 D1,B3 D3,B1 D2,B2 D3,B2 K6,C3 K3,C3 K4,C3 K2,C2 K5,C2 K3,C2 Reduce Task 1 Reduce Task 2 Sort and Group (D2) K1,[C1] K2,[C1,C4] K3,[C1,C3] K4,[C4,C3] K5,[C4] K6,[C3] K1,I K2,I K3, I K4, I K5, I K6,I K1, I K2, I K3, I K5, I K6, I K1,[C6] K2,[C2] K3,[C2,C6] K5,[C2] K6,[C6] D1,B1 D2,B1 D1,B2 K1,C1 K2,C1 K3,C1 M R M M M M M R R R R R M R R R R R Sort and Group (D1) K6,C6 K3,C6 K1,C6 K5,C4 K2,C4 K4,C4 K5,C7 K6,C7 K4,C7 K4,C5 K1,C5 K6,C5 Map reduce Architecture

A single Combination – An illustration

S1= “AGTA”; A2=“ATA”; A3=“GAT” 1. ALIGNMENT OF SI & S2 2. ALIGNMENT OF A1SI & S3 SCORE: 4 A1S1:“AGTA”; A1S2:“A_TA” SCORE: -5 A2S1:“AG_TA”; A1S3:“_GAT_”

3. ALIGNMENT OF A1S2 & A1S3 SCORE: -3 A2S2:“A _ _TA_”; A2S3:“ _GAT_ _”

Analysis ‘n’ – Number of Sequences ‘N’ – Average length of a sequence ‘k’ – Average number of blocks in a sequence ‘K’ – Size of 1 block

2. Parallelised data trasfer ‘T’ – Time for sequence transfer serially & ‘k’ – block size T/k – Time for sequence transfer in parallel 3. Dynamic cluster creation Advantage: Computation power of remote cluster is optimal and not wasted Disadvantage: Time to set up the cluster

Experimental Setup • Core – 2 Duo processors – 2.8 GHz - 160GB HD, 2 GB RAM • LAN- 100 Mbps. • OS - RHEL v5 • Client virtual environment - 4 VMs • Server cluster - 5 machines • Hadoop DFS in fully distributed mode • OpenVZ was used for virtualization

Effect of parallel file transfer C1: Communication time from 3 client VMs to server without multithreading. C2: Communication time from 3 client VMs to the server with multithreading. T1: Total time for file transfer from client to server without multi threading T2: Total time for file transfer from client to server with multi threading

Time to start virtual machines Parallelised starting of VMs can be done to reduce time

3 4 5 6 7 8 9 10 11 12 cluster performance wrt number of VMs 30 KB sequences with 2 KB splits – upto 5 sequences Number of sequences is less than 6, a five node hadoop cluster is sufficient.

Dynamic scaling up/down of clusters VMs instantiated based on number of Map-Reduce Tasks Dynamically number of tasks were checked up  New VMs started and tasks were reallocated Old VMs were destroyed if not used

Conclusion 1) Proposed MSA improves on the computation time and also maintains the accuracy. • Parallelism of sequence alignment in three levels. Hadoop data grids - Data and compute parallelism & scalability • Dynamic Programming - accuracy. 2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)] • Combining progressive and dynamic approaches. • Blocking in hadoop 3) Enhancements (using clouds for MSA) • Automatic configuration of the cloud environment based on the computational needs • Efficient upload of data into the HDFS by parallel transfer of sequence fragments over the Internet.

Acknowledgements The Research has been carried out as a result of PSG-Yahoo Research programme on Grid and Cloud computing. Sincere Thanks to 1) Dr R Rudramoorthy, Principal, PSG College of Techniology, Coimbatore. 2) Mr K V Chidambaran, Director, Grid and Cloud Systems Group, Yahoo, Bangalore

THANK YOU QUESTIONS?

REFERENCES • Apache, (2002), Hadoop Documentation, retrieved on September 20, 2009, fromhttp://hadoop.apache.org/core/docs/r0.17.2/. • Tahir, N., Imitaz, S. and Shaftab, A., “Parallel Needleman-Wunsch Algorithm for Grid”. retrieved on January 19, 2009 from http://www.gridbus.org/~alchemi/files/Parallel%20Needleman% 20Algo.pdf • Michael, C., (2009). “Cloud Burst: highly sensitive read mapping with MapReduce”, Bioinformatics, 25(11), 1363-1369. • Lee, T., “A genomic CluE for Cloud Computing”, retrieved on January 13, 2009 from http://www.eurekalert.org/pub_releases /2009-04/uom-agc042309.php • Yongli, H. and Shen, J., “Sequence analysis scale up and acceleration using Grid and Cloud Computing yield efficient analyses of HIV-1 variants and other viruses”, retrieved on February 15, 2009 from www.iscb.org /uploaded/css/43/12056.pdf. • Philip, P., Andres, L., Eyal, L. and Michael, B. “Adding the easy button to the cloud with SnowFlock and MPI”, in Proceedings of 3rd ACM workshop in system level virtualization for HPC (2009), 122-127.

Parallelized Multiple Sequence Alignment on the Public Cloud

Parallelized Multiple Sequence Alignment on the Public Cloud

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment