Improving Multiple Sequence Alignment in Multi-client Environments

Improving Performance of Multiple Sequence Alignment Analysis inMulti-client EnvironmentsUse of Inexpensive Storage as Grid Cache Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The Ohio State University Ohio Supercomputer Center March 2, 2004, BMI 731 - Biomedical Data Management

Outline • Multi Sequence Alignment • CLUSTALW • Sequence Analysis in Multiple Client Environment • Caching Intermediate Results • Deployment on SMP Machine • Deployment on Distributed Memory Machine • Experimental Results • Conclusion March 2, 2004, BMI 731 - Biomedical Data Management

Sequence Alignment • alignment is a mutual arrangement of two sequences • where the two sequences are similar, and where they differ Sequence s: AAT AGCAA AGCACACA Sequence t: TAA ACATA ACACACTA Hamming Dist: 2 3 6 March 2, 2004, BMI 731 - Biomedical Data Management

Edit Distance Unit Cost: s: AGCACAC-A AG-CACACA t: A-CACACTA or ACACACT-A cost 2 cost 4 distance(s, t) = 2 March 2, 2004, BMI 731 - Biomedical Data Management

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG Multiple Sequence Alignment or Optimal: O(2n |si|) 6 sequences of length 100 if constant is 10-9 seconds running time 6.4 x 104 seconds add 2 sequences running time 2.6 x 109 seconds March 2, 2004, BMI 731 - Biomedical Data Management

CLUSTAL W • Based on Higgins & Sharp CLUSTAL [Gene88] • Progressive alignment-based strategy • Pairwise Alignment (n2l2) • A distance matrix is computed using either an approximate method (fast) or dynamic programming (more accurate, slower) • Computation of Guide Tree (n3): phylogenetic tree • Computed from the distance matrix • Iteratively selecting aligned pairs and linking them. • Progressive Alignment (nl2) • A series of pairwise alignments computed using full dynamic programming to align larger and larger groups of sequences. • The order in the Guide Tree determines the ordering of sequence alignments. • At each step; either two sequences are aligned, or a new sequence is aligned with a group, or two groups are aligned. • n: number of sequences in the query • l : average sequence length March 2, 2004, BMI 731 - Biomedical Data Management

Sequence Analysis in Multiple Client Environment • Many Gene and Protein databases can be accessed over Internet • Multiple request by multiple client • Data Caching • Cache pairwise alignments • Most expensive phase • Computations are independent March 2, 2004, BMI 731 - Biomedical Data Management

Data Caching • Low-cost high-performance, high-capacity commodity hardware • Disks are cheap: 100GB EIDE Disks around $250. • A PC costs around $700-$1000 • no monitor, • no high-end graphics card, • moderate size memory (128MB-512MB) • Switched fast ethernet • Better performance with channel bonding • In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000 • In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000 • BMI Storage Cluster  7.2TB, 24 PCs = $50,000-$55,000 • UMD Storage Cluster  9.5 TB, 50 PCs March 2, 2004, BMI 731 - Biomedical Data Management

Caching Pairwise Alignment Scores • Sequence -> Unique ID (UID): • use Hash (tested 10 hash functions including MD5; 4 of them gives similar result with MD5) • Resolve collisions and assign UID to each sequence • For more than 1 million sequences from GenBank max collision per hash value was 3: constant time • For each pairwise alignment, store two UIDs and a float score • B-Tree: used GIST B-Tree implementation March 2, 2004, BMI 731 - Biomedical Data Management

Sequence -> Unique ID (UID): March 2, 2004, BMI 731 - Biomedical Data Management

Deployment on SMP Machine • A hash table is used to associate a sequence with a unique integer ID (UID) • Partitioned B tree stores pairwise alignment results • Cache partition chosen by min (UID1, UID2)% #Partitions • Multiple threads for Pairwise alignment computation March 2, 2004, BMI 731 - Biomedical Data Management

DataCutter • Component Framework for Combined Task/Data Parallelism • Core Services • Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method. • Filtering Service: Distributed C++ component framework • User defines sequence of pipelined components (filters and filter groups) • Pleasingly Parallel • Generalized Reduction • User directive tells preprocessor/runtime system to generate and instantiate copies of filters • Stream based communication • Multiple filter groups can be active simultaneously • Flow control between transparent filter copies • Replicated individual filters • Transparent: single stream illusion http://www.datacutter.org March 2, 2004, BMI 731 - Biomedical Data Management

Deployment on Distributed Memory Machine Hash (UniqueID) DataCutter version of ClustalW – v1 • Hash Filter • Stores/computes sequence to unique IDs mapping • Partitioned (declustered) hash • Cache Filter • Partitioned (declustered) cache • computes pairwise alignment if it doesn’t exist in the cache • Owner computes: computational imbalance • CLUSTALW Filter • computes guide tree generation and progressive alignment CLUSTALW Cache & Compute March 2, 2004, BMI 731 - Biomedical Data Management

Deployment on Distributed Memory Machine Hash (UniqueID) DataCutter version of ClustalW – v2 DC-ClustalW-v1 + • Separate Pairwise Alignment Filter • Cache misses computed in Pairwise Align • Balanced computation • Handles multiple queries • multiple copies of CLUSTALW filter CLUSTALW Cache Pairwise Align March 2, 2004, BMI 731 - Biomedical Data Management

Deployment on Distributed Memory Machine DataCutter version of ClustalW – v2 Host-1 H CW Multiple Query Processing -QueryManager Filter -ClustalW Filter -Hash Filter -Cache Filter -Pairwise Alignment Filter C Host-n+1 Host-0 P QM Host-n H CW C Host-2n P March 2, 2004, BMI 731 - Biomedical Data Management

Experimental Setup • Pentium III 650 MHz, 768MB Memory • 1000 random sequences from GPCR • Average length 450 amino acids per sequence • 24-Processor Sun Fire 6800, 750MHz, 24GB Memory • 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

Experiment 1 – Execution Time of CLUSTAL W • Pentium III 650 MHz, 768MB Memory • 1000 random sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

Experiment 2 - SMP Results • 24-Processor Sun Fire 6800, 750MHz, 24GB Memory • 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query March 2, 2004, BMI 731 - Biomedical Data Management

Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1 • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

Experiment 3 – Distributed Memory DataCutter version of ClustalW – v21 ClustalW filterintra-query parallelization • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2Multiple ClustalW filtersinter-query parallelization • 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk • 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW • 64 queries each consist of 40 unique protein sequences from GPCR • Average length 450 amino acids per sequence March 2, 2004, BMI 731 - Biomedical Data Management

Conclusion • Caching intermediate results • computational intensive application  data intensive application • SMP • Distributed Memory implementation with DataCutter March 2, 2004, BMI 731 - Biomedical Data Management

Improving Multiple Sequence Alignment in Multi-client Environments

Improving Multiple Sequence Alignment in Multi-client Environments

Presentation Transcript

The OSUCCC Biomedical Informatics Shared Resource Director: Dr. Joel Saltz, Ph.D., M.D.

Joel

Renato Calabria

Joel

Renato Jewellers

Eric Larsen, Colin Trettel, Mike Tripp, Curtis Desmarais

JOEL

Tahsin Tecelli Öpöz

WILLIAM MARTIN JOEL “BILLY” JOEL

Eric Perkins and Mike Treagy

Eric Todd Jenna Kong Mike DeRosa

Umit Catalyurek , Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz

Eric Harmsen, Joel Colon, Carmen Lis Arcelay and Dionel Cadiz

JOEL

Joel

JOEL

Joel

Joel

Joel