Improving performance of Multiple Sequence Alignment in Multi-client Environments

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation

Overview • Overview of talk • CLUSTALW algorithm, speedup opportunities • Problems with caching • Parallelizing technique • Weaknesses • Applying technique to other bioinformatics problems CMSC 838T – Presentation

Motivation • Query overlap in queries submitted to MSA tools • Single researcher: new sequences vs. database • Multiple researchers: similar subsets • CLUSTALW: Progressive algorithm • Three steps • Progressive refinement • Opportunities for speedup • Caching • Query ordering CMSC 838T – Presentation

CLUSTALW: Progressive global alignment • Step 1: Pairwise alignment, distance matrix • Fast technique calculates distance between two scores • Calculated for all sequence pairs • Cost: O(q2l2) • Step 2: Guide tree • Group nearest first • Build tree sequentially • Cost: O(q3) • Step 3: Progressive alignment • Align, starting at leaves of tree • Cost: O(ql2) * q sequences – mean length l CMSC 838T – Presentation

Optimization: Query caching • Step 1: Pairwise alignment, building distance matrix • Many requests partially duplicated • Individual distance calculation not dependent on rest of query • Observation: Dominant step in execution time • Steps 2, 3: • Output dependent on results of entire query • Results less reusable • Technique: cache output of step 1 • Individual distances Query 2 Query 1 CMSC 838T – Presentation

Challenges to cache implementation • I/O and filesystem overhead • Large cache vs. 2GB file size limit • High seek times within single file • Search and insertion overhead • Sequence: lengthy key • Keyed on each pair of sequences CMSC 838T – Presentation

Technique: 2-level B-Tree cache • Level 1: Map sequence text to sequence ID • Hash of sequence? • Sequentially assigned number • Cache size: O(ql) • Level 2: Map ID pairs to calculated distance • Concatenate IDs from level 1 • Lower Level 1 ID -> upper half of Level 2 key • Cache size: O(q2) • Distribute level 2 cache across bins • Round robin or block allocated • Distribute bins across machines * q sequences – mean length l CMSC 838T – Presentation

SMP • Parallelizable: • Pairwise searches performed independently • Farmed out to query threads Web server Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation

Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) SMP • Challenge: Cache coherence • Read-only? • Requires advance knowledge of query details • Online update and serialization? • Locking, duplicate updates • Offline updates? • Per-thread list of cache changes CMSC 838T – Presentation

Evaluation: Implementation • Public B-Tree implementation: GIST library • First evaluation on Intel PC • (Pentium III 650, 75GB disks) • q = 25-1000 sequences • l = 450 amino acids per sequence • Second evaluation on Sun Fire • (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory) • l = 417 amino acids per sequence • q = 2-200 sequences • Seeded cache with dummy values • Future work: architectural impact CMSC 838T – Presentation

Evaluation: Results CMSC 838T – Presentation

Observations • Simple technique • Cheap and easy to implement • Cheap and easy to deploy • Unsupported claim: Are queries really similar? • Concern about distribution across processors • Paper mentions latency, workload balancing • Also reliability of distributed bins • Cache lifetimes? • Proposed solution “component-based system” • “Hand-wavey”; would like to see more. CMSC 838T – Presentation

Improving performance of Multiple Sequence Alignment in Multi-client Environments

Improving performance of Multiple Sequence Alignment in Multi-client Environments

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment