120 likes | 292 Views
Improving performance of Multiple Sequence Alignment in Multi-client Environments. Aaron Zollman CMSC 838 Presentation. Overview. Overview of talk CLUSTALW algorithm, speedup opportunities Problems with caching Parallelizing technique Weaknesses
E N D
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation
Overview • Overview of talk • CLUSTALW algorithm, speedup opportunities • Problems with caching • Parallelizing technique • Weaknesses • Applying technique to other bioinformatics problems CMSC 838T – Presentation
Motivation • Query overlap in queries submitted to MSA tools • Single researcher: new sequences vs. database • Multiple researchers: similar subsets • CLUSTALW: Progressive algorithm • Three steps • Progressive refinement • Opportunities for speedup • Caching • Query ordering CMSC 838T – Presentation
CLUSTALW: Progressive global alignment • Step 1: Pairwise alignment, distance matrix • Fast technique calculates distance between two scores • Calculated for all sequence pairs • Cost: O(q2l2) • Step 2: Guide tree • Group nearest first • Build tree sequentially • Cost: O(q3) • Step 3: Progressive alignment • Align, starting at leaves of tree • Cost: O(ql2) * q sequences – mean length l CMSC 838T – Presentation
Optimization: Query caching • Step 1: Pairwise alignment, building distance matrix • Many requests partially duplicated • Individual distance calculation not dependent on rest of query • Observation: Dominant step in execution time • Steps 2, 3: • Output dependent on results of entire query • Results less reusable • Technique: cache output of step 1 • Individual distances Query 2 Query 1 CMSC 838T – Presentation
Challenges to cache implementation • I/O and filesystem overhead • Large cache vs. 2GB file size limit • High seek times within single file • Search and insertion overhead • Sequence: lengthy key • Keyed on each pair of sequences CMSC 838T – Presentation
Technique: 2-level B-Tree cache • Level 1: Map sequence text to sequence ID • Hash of sequence? • Sequentially assigned number • Cache size: O(ql) • Level 2: Map ID pairs to calculated distance • Concatenate IDs from level 1 • Lower Level 1 ID -> upper half of Level 2 key • Cache size: O(q2) • Distribute level 2 cache across bins • Round robin or block allocated • Distribute bins across machines * q sequences – mean length l CMSC 838T – Presentation
SMP • Parallelizable: • Pairwise searches performed independently • Farmed out to query threads Web server Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation
Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) SMP • Challenge: Cache coherence • Read-only? • Requires advance knowledge of query details • Online update and serialization? • Locking, duplicate updates • Offline updates? • Per-thread list of cache changes CMSC 838T – Presentation
Evaluation: Implementation • Public B-Tree implementation: GIST library • First evaluation on Intel PC • (Pentium III 650, 75GB disks) • q = 25-1000 sequences • l = 450 amino acids per sequence • Second evaluation on Sun Fire • (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory) • l = 417 amino acids per sequence • q = 2-200 sequences • Seeded cache with dummy values • Future work: architectural impact CMSC 838T – Presentation
Evaluation: Results CMSC 838T – Presentation
Observations • Simple technique • Cheap and easy to implement • Cheap and easy to deploy • Unsupported claim: Are queries really similar? • Concern about distribution across processors • Paper mentions latency, workload balancing • Also reliability of distributed bins • Cache lifetimes? • Proposed solution “component-based system” • “Hand-wavey”; would like to see more. CMSC 838T – Presentation