1 / 23

Improving Multiple Sequence Alignment in Multi-client Environments

This research paper explores the use of inexpensive storage as grid cache to improve the performance of multiple sequence alignment analysis in multi-client environments. The paper discusses the deployment on SMP and distributed memory machines, caching intermediate results, and presents experimental results.

Download Presentation

Improving Multiple Sequence Alignment in Multi-client Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Performance of Multiple Sequence Alignment Analysis inMulti-client EnvironmentsUse of Inexpensive Storage as Grid Cache Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The Ohio State University Ohio Supercomputer Center March 2, 2004, BMI 731 - Biomedical Data Management

  2. Outline • Multi Sequence Alignment • CLUSTALW • Sequence Analysis in Multiple Client Environment • Caching Intermediate Results • Deployment on SMP Machine • Deployment on Distributed Memory Machine • Experimental Results • Conclusion March 2, 2004, BMI 731 - Biomedical Data Management

  3. Sequence Alignment • alignment is a mutual arrangement of two sequences • where the two sequences are similar, and where they differ Sequence s: AAT AGCAA AGCACACA Sequence t: TAA ACATA ACACACTA Hamming Dist: 2 3 6 March 2, 2004, BMI 731 - Biomedical Data Management

  4. Edit Distance Unit Cost: s: AGCACAC-A AG-CACACA t: A-CACACTA or ACACACT-A cost 2 cost 4 distance(s, t) = 2 March 2, 2004, BMI 731 - Biomedical Data Management

  5. VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG Multiple Sequence Alignment or Optimal: O(2n |si|) 6 sequences of length 100 if constant is 10-9 seconds running time 6.4 x 104 seconds add 2 sequences running time 2.6 x 109 seconds March 2, 2004, BMI 731 - Biomedical Data Management

  6. CLUSTAL W • Based on Higgins & Sharp CLUSTAL [Gene88] • Progressive alignment-based strategy • Pairwise Alignment (n2l2) • A distance matrix is computed using either an approximate method (fast) or dynamic programming (more accurate, slower) • Computation of Guide Tree (n3): phylogenetic tree • Computed from the distance matrix • Iteratively selecting aligned pairs and linking them. • Progressive Alignment (nl2) • A series of pairwise alignments computed using full dynamic programming to align larger and larger groups of sequences. • The order in the Guide Tree determines the ordering of sequence alignments. • At each step; either two sequences are aligned, or a new sequence is aligned with a group, or two groups are aligned. • n: number of sequences in the query • l : average sequence length March 2, 2004, BMI 731 - Biomedical Data Management

  7. Sequence Analysis in Multiple Client Environment • Many Gene and Protein databases can be accessed over Internet • Multiple request by multiple client • Data Caching • Cache pairwise alignments • Most expensive phase • Computations are independent March 2, 2004, BMI 731 - Biomedical Data Management

  8. Data Caching • Low-cost high-performance, high-capacity commodity hardware • Disks are cheap: 100GB EIDE Disks around $250. • A PC costs around $700-$1000 • no monitor, • no high-end graphics card, • moderate size memory (128MB-512MB) • Switched fast ethernet • Better performance with channel bonding • In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000 • In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000 • BMI Storage Cluster  7.2TB, 24 PCs = $50,000-$55,000 • UMD Storage Cluster  9.5 TB, 50 PCs March 2, 2004, BMI 731 - Biomedical Data Management

  9. Caching Pairwise Alignment Scores • Sequence -> Unique ID (UID): • use Hash (tested 10 hash functions including MD5; 4 of them gives similar result with MD5) • Resolve collisions and assign UID to each sequence • For more than 1 million sequences from GenBank max collision per hash value was 3: constant time • For each pairwise alignment, store two UIDs and a float score • B-Tree: used GIST B-Tree implementation March 2, 2004, BMI 731 - Biomedical Data Management

  10. Sequence -> Unique ID (UID): March 2, 2004, BMI 731 - Biomedical Data Management

  11. Deployment on SMP Machine • A hash table is used to associate a sequence with a unique integer ID (UID) • Partitioned B tree stores pairwise alignment results • Cache partition chosen by min (UID1, UID2)% #Partitions • Multiple threads for Pairwise alignment computation March 2, 2004, BMI 731 - Biomedical Data Management

  12. DataCutter • Component Framework for Combined Task/Data Parallelism • Core Services • Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method. • Filtering Service: Distributed C++ component framework • User defines sequence of pipelined components (filters and filter groups) • Pleasingly Parallel • Generalized Reduction • User directive tells preprocessor/runtime system to generate and instantiate copies of filters • Stream based communication • Multiple filter groups can be active simultaneously • Flow control between transparent filter copies • Replicated individual filters • Transparent: single stream illusion http://www.datacutter.org March 2, 2004, BMI 731 - Biomedical Data Management

  13. Deployment on Distributed Memory Machine Hash (UniqueID) DataCutter version of ClustalW – v1 • Hash Filter • Stores/computes sequence to unique IDs mapping • Partitioned (declustered) hash • Cache Filter • Partitioned (declustered) cache • computes pairwise alignment if it doesn’t exist in the cache • Owner computes: computational imbalance • CLUSTALW Filter • computes guide tree generation and progressive alignment CLUSTALW Cache & Compute March 2, 2004, BMI 731 - Biomedical Data Management

  14. Deployment on Distributed Memory Machine Hash (UniqueID) DataCutter version of ClustalW – v2 DC-ClustalW-v1 + • Separate Pairwise Alignment Filter • Cache misses computed in Pairwise Align • Balanced computation • Handles multiple queries • multiple copies of CLUSTALW filter CLUSTALW Cache Pairwise Align March 2, 2004, BMI 731 - Biomedical Data Management

  15. Deployment on Distributed Memory Machine DataCutter version of ClustalW – v2 Host-1 H CW Multiple Query Processing -QueryManager Filter -ClustalW Filter -Hash Filter -Cache Filter -Pairwise Alignment Filter C Host-n+1 Host-0 P QM Host-n H CW C Host-2n P March 2, 2004, BMI 731 - Biomedical Data Management

  16. Experimental Setup • Pentium III 650 MHz, 768MB Memory • 1000 random sequences from GPCR • Average length 450 amino acids per sequence • 24-Processor Sun Fire 6800, 750MHz, 24GB Memory • 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  17. Experiment 1 – Execution Time of CLUSTAL W • Pentium III 650 MHz, 768MB Memory • 1000 random sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  18. Experiment 2 - SMP Results • 24-Processor Sun Fire 6800, 750MHz, 24GB Memory • 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query March 2, 2004, BMI 731 - Biomedical Data Management

  19. Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1 • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  20. Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1 • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  21. Experiment 3 – Distributed Memory DataCutter version of ClustalW – v21 ClustalW filterintra-query parallelization • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  22. Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2Multiple ClustalW filtersinter-query parallelization • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

  23. Conclusion • Caching intermediate results • computational intensive application  data intensive application • SMP • Distributed Memory implementation with DataCutter March 2, 2004, BMI 731 - Biomedical Data Management

More Related