1 / 12

Improving performance of Multiple Sequence Alignment in Multi-client Environments

Improving performance of Multiple Sequence Alignment in Multi-client Environments. Aaron Zollman CMSC 838 Presentation. Overview. Overview of talk CLUSTALW algorithm, speedup opportunities Problems with caching Parallelizing technique Weaknesses

coy
Download Presentation

Improving performance of Multiple Sequence Alignment in Multi-client Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation

  2. Overview • Overview of talk • CLUSTALW algorithm, speedup opportunities • Problems with caching • Parallelizing technique • Weaknesses • Applying technique to other bioinformatics problems CMSC 838T – Presentation

  3. Motivation • Query overlap in queries submitted to MSA tools • Single researcher: new sequences vs. database • Multiple researchers: similar subsets • CLUSTALW: Progressive algorithm • Three steps • Progressive refinement • Opportunities for speedup • Caching • Query ordering CMSC 838T – Presentation

  4. CLUSTALW: Progressive global alignment • Step 1: Pairwise alignment, distance matrix • Fast technique calculates distance between two scores • Calculated for all sequence pairs • Cost: O(q2l2) • Step 2: Guide tree • Group nearest first • Build tree sequentially • Cost: O(q3) • Step 3: Progressive alignment • Align, starting at leaves of tree • Cost: O(ql2) * q sequences – mean length l CMSC 838T – Presentation

  5. Optimization: Query caching • Step 1: Pairwise alignment, building distance matrix • Many requests partially duplicated • Individual distance calculation not dependent on rest of query • Observation: Dominant step in execution time • Steps 2, 3: • Output dependent on results of entire query • Results less reusable • Technique: cache output of step 1 • Individual distances Query 2 Query 1 CMSC 838T – Presentation

  6. Challenges to cache implementation • I/O and filesystem overhead • Large cache vs. 2GB file size limit • High seek times within single file • Search and insertion overhead • Sequence: lengthy key • Keyed on each pair of sequences CMSC 838T – Presentation

  7. Technique: 2-level B-Tree cache • Level 1: Map sequence text to sequence ID • Hash of sequence? • Sequentially assigned number • Cache size: O(ql) • Level 2: Map ID pairs to calculated distance • Concatenate IDs from level 1 • Lower Level 1 ID -> upper half of Level 2 key • Cache size: O(q2) • Distribute level 2 cache across bins • Round robin or block allocated • Distribute bins across machines * q sequences – mean length l CMSC 838T – Presentation

  8. SMP • Parallelizable: • Pairwise searches performed independently • Farmed out to query threads Web server Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation

  9. Query Thread Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) SMP • Challenge: Cache coherence • Read-only? • Requires advance knowledge of query details • Online update and serialization? • Locking, duplicate updates • Offline updates? • Per-thread list of cache changes CMSC 838T – Presentation

  10. Evaluation: Implementation • Public B-Tree implementation: GIST library • First evaluation on Intel PC • (Pentium III 650, 75GB disks) • q = 25-1000 sequences • l = 450 amino acids per sequence • Second evaluation on Sun Fire • (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory) • l = 417 amino acids per sequence • q = 2-200 sequences • Seeded cache with dummy values • Future work: architectural impact CMSC 838T – Presentation

  11. Evaluation: Results CMSC 838T – Presentation

  12. Observations • Simple technique • Cheap and easy to implement • Cheap and easy to deploy • Unsupported claim: Are queries really similar? • Concern about distribution across processors • Paper mentions latency, workload balancing • Also reliability of distributed bins • Cache lifetimes? • Proposed solution “component-based system” • “Hand-wavey”; would like to see more. CMSC 838T – Presentation

More Related