190 likes | 330 Views
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader]. Kexue Liu CMSC 838 Presentation. Motivation. Phylogeny reconstruction from molecular data Poses complex optimization problem NP hard and thus computationally intractable
E N D
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation
Motivation • Phylogeny reconstruction from molecular data • Poses complex optimization problem • NP hard and thus computationally intractable • High performance Algorithm Engineering • Reduce the running time of existing phylogenetic algoritms CMSC 838T – Presentation
Talk Overview • Overview of talk • Background • Breakpoint Phylogeny • Breakpoint Analysis • Re-Engineering Techniques • Impact in computational Biology • Observations CMSC 838T – Presentation
Background • Algorithm Engineering • Transform a pencil-and-paper algorithm into an efficient, robust implementation. • Main focus is experimentation • High Performance Algorithm Engineering • Running time and quality of the solution as the paramount goal • Includes parallelism • Refining serial part of the code • Cache-aware programming is a key to performance CMSC 838T – Presentation
Background • Phylogeny • Reconstruction of the evolutionary history of a collection of organisms • Takes the form of an evolutionary tree • Computational Phylogenetics • Is extremely computation-intensive • Methods for sequence data (RNA, DNA, amino acid, Protein) do not scale up to whole genome • Genome level data • At this level, evolution is slow • Enable us to recover deep evolutionary relationships • Much hard to analyze than sequence data • Optimization criteria • Heuristics • Parsimony criterion • Maximum likelihood CMSC 838T – Presentation
Breakpoint Phylogeny • Deal with simple genomic data • Organisms have a single chromosome or contain single-chromosome organelles • Each chromosome can be represented by an ordering of oriented genes. • Evolutionary process includes inversion, transposition, insertion, deletion and duplication. • Approaches • Construct parsimonious tree • Known or conjectured to be NP hard • No automated tool to solve it • Neighbor-joining heuristics • Fast and valuable • Can’t recover the ancestral gene orders. • Breakpoint phylogeny by Blanchette and Sankoff. CMSC 838T – Presentation
Breakpoint phylogeny • More special case: • All the genomes have the same set of genes • Each gene appears once. • Is of interest to biologists • Inversions are the main evolutionary mechanism on such genomes • Works well for certain datasets. • Implementation developed by Sankoff and Blanchette • Breakpoint Analysis • Too slow to be used on anything other than small datasets with a few genes. CMSC 838T – Presentation
Breakpoint Analysis: Details • Breakpoint: • Two genomes G and G’ with the same set of genes and each gene appears exactly once in each genome • Ordered pair of genes, (gi , gj) appears in G • Neither (gi , gj) nor (-gj , -gi) appears in G’ • Breakpoint Distance • Number of breakpoints between two genomes. • Median for three genomes • The genome which minimizes the breakpoint distance • Median Problem for Breakpoints • Construct a median of given genomes • NP hard CMSC 838T – Presentation
Breakpoint Analysis • Method developed by Sankoff and Blanchette to solve breakpoint phylogeny • Uses reduction from MPB to Travelling Salesman Problem • Directed MPB to undirected TSP • Representing each gene by a pair of cities connected by an edge • Outer loop enumerates all (2n-5)!! trees on n leaves • Inner loop runs unknown number of iterations • Computation complexity is exponential in each of the number of genomes and the number of genes. CMSC 838T – Presentation
Breakpoint Analysis Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, C do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the score of T, then do it until no internal node can be relabelled CMSC 838T – Presentation
Re-Engineering Techniques • Profiling: • Identify bottlenecks to balance implementation • Eliminate problems which include excessive resource consumption or poor results. • Examples: • Hand-unrolling loops, cut the running time down by a factor at least six. • Refine distance computations • Refine lower bound computations • Speed-up by one order of magnitude on Campanulaceae dataset CMSC 838T – Presentation
Re-Engineering Techniques • Cache Awareness • Memory footprint • BPAnalysis: 60MB • GRAPPA: 1.8MB • Memory locality • BPAnalysis: poor locality, working set size of about 12MB • GRAPPA : good locality, working set size of about 600KB • Minimizing pointer dereferencing • Reuses allocated storage • Studies indicate that gain is likely to be factors of anywhere from 2 to 40 CMSC 838T – Presentation
Re-Engineering Techniques • Low-level Algorithmic Changes • Using all of the available information • Examples: • Using lower bound to eliminate over 95% of the tree. • Take advantage of special structures: TSP has only two nontrivial edges( cost 1 and cost 2) • Speed-up by a factor of 5-10. CMSC 838T – Presentation
Re-Engineering Techniques: Parallel Aspects • Efficient Tree Generation, • Avoid unbounded-precision arithmetic • Allow generation from any count with variable gap • Provides parallel generation and also sampling of search space • Portable MPI implementation, each processor handles a fraction of trees. • On the 512-processor Alliance cluster LOS LOBOS at UNM, obtained a 512-fold speedup. • Summarize speedups: • Profiling: one order of magnitude • Cache awareness: factors of anywhere from 2 to 40 • Low-level Algorithmic changes: 5-10 • 512-processor parallelism: 512 • Overall, Grappa demonstrated a million-fold speedup over the original implementation CMSC 838T – Presentation
Evaluation: the Bluebell Family • Dataset: full gene sequences for the chloroplasts of 12 species of Campanulaceae (Bluebells), plus tobacco. • Chloroplast • A semi-independent organism that lives within plant cells and allow them to photosynthesize. • Have a single chromosome with about 120 genes. • Optimization target: reconstruct the phylogeny with the least total amount of genomic changes. • Environment: 512-processor Los Lobos supercluster at UNM • Results: • Speedup by three to four orders in the serial part • Total speedup by over one million CMSC 838T – Presentation
Phylogeny of Bluebell Family CMSC 838T – Presentation
Impact in Computational Biology • Much faster implementations • Alter the practice of research in biology and medicine • Reducing the time of an analysis from two years down to a day • Makes an enormous difference in the pace and cost of drug discover and development • Fast and accurate analysis software • Enables researchers to pursue more leads, develop better institution on small dataset • Form new conjectures about biological mechanism CMSC 838T – Presentation
Observations • Algorithm re-engineering • Uncovers salient characteristic of the algorithm • Enable us to develop better algorithms • Example: find a true linear time algorithm for computing inversion distance in the development of GRAPPA. • Can be applied to any existing bioinformatics algorithms • Several have been engineered for performance, such as BLAST • Limited benefits in theoretical terms when applied to NP-hard optimization problems • Does not scale up to “industrial-strengthen” • Grappa only enables to move from 10 taxa to 13 taxa CMSC 838T – Presentation
Thank you CMSC 838T – Presentation