Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013

DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013

Contribution • Research Aspect -A framework to solve the maximum parsimonious tree with the input of unequal genome contents. -Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space. -provide a benchmark for the HPC community. • Engineering Aspect -Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc. -The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).

Why Phylogenetic Tree Problem is Hard? • For N genomes, there are (N-3)!! number of possible tree topologies. • For each topology, we need to compute at least one different median, the possible median order are (g-2)!! . g is the number of genes. • To validate each possible median, if the gene content has duplications, it’s NP hard. • So the complexity type of computing the MP tree with uneuqal contents genomes is: NP hard over NP hard over NP hard!

Phylogenetic Tree This picture presents the phylogeny of the “12 Drosophila.” From http://insects.eugenes.org/species

Maximum Parsimony Concept 5 6 5 6 4 2 3 4 1 1 3 2 5 6 1 4 2 3 Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length

Genome Rearrangement http://ai.stanford.edu/~serafim/CS374_2006/presentations/lecture17.ppt

Genome Rearrangement In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution. 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 4 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10

Genome Median Computation 5 6 5 6 4 2 3 3 1 4 2 1 4 4 3 3 1 1 5 5 6 6 2 2

Genome Median Computation 1,2,3 4 1,-3,-2 -2,-1,3 3 1 5 6 1,2,3 = 2 moves 2,-1,3 = 5 moves ….. 2

Step 1: Spectral Partition

Step 2: Compute MP Tree for Each Sub-Disk

4 3 5 2 6 1 7 8 Step 2-1: How to Compute Median (BNB) 4 4 3 3 5 5 4 3 2 2 5 6 6 2 6 1 1 7 7 1 8 8 7 8 4 3 5 2 6 1 7 8 4 3 5 2 6 4 4 3 3 5 5 1 2 2 6 6 7 8 1 1 7 7 8 8

Step 2-2: How to Compute Median (LK) …………………. stop

Step 2-2: How to Evaluate Median 1 1, 2, 3, 4, 3, 6, 5 med 1, 2, 3, 3, 4, 6, 5 2 1, 2, 3, 4, 6, 3, 5 3 1, 2, 5, 4, 6, 3, 3 Dis(m,1)+Dis(m,2)+Dis(m,3)

Step 2-2: How to Evaluate Median 1, 2, 3, 3, 4, 6, 5 1, 2, 3, 4, 3, 5 Find a mapping first (NP hard) dis=1 1, 2, 3, 3, 4, 6, 5 -2, -1, 3, 3, 4, 5 Complete the loss (polynomial) dis =2 1, 2, 3, 4, 6, 5 -2, -1, 3, 4, 6, 5 Compute DCJ (polynomial) dis =3 1, 2, 3, 4, 6, 5 1, 2, 3, 4, 6, 5

Step 3: Merge Disks Decomposition of The disks Construct a tree for each disk Merge the tree using A specific consensus method: Strict, majority etc… Disambiguation

Step 4: Initialization Init by insertion Which is local 4 3 1 5 6 c X 2 b 1 2 e Init by prospection Which is global. d

Step5: Iterative Refinement 1 2 a b 3 4

Review • Step 1: Spectral partition • Step 2: Subtree construction • Step 3: Supertree merge • Step 4: Initialization of complete tree using General Adequate Subgraph (GAS) method. • Step 5: Iterative Refinement until the complete tree converged.

Result—Simulated Data seed #Theta+ #gamma+ #phi operations We grow our own tree We know the total number of evolution event in the model tree

Result--Accuracy %of duplication 0.1 % of loss 0.1 Theta is % of inversion There are 8 species 2*8-3 =13edges. So the average accuracy is ~90%

Result – Real Data SCRaMbLE Matrix • We can represent a SCRaMbLEd strain by its vector. • The sign gives the orientation. • The color encodes the position in the synthetic chromosome.

Result – Real Data #inversion:#insertion/deletion:#duplication

Parallel Method [Bader 05] Load Balancing Parallel search

Experimental Results (Parallel)

Why Many-core BnB? • So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR). • Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient. • But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).

Experimental Results (Intel Phi knapsack)

Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013

Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013

Presentation Transcript

Joint Concepts Steering Group Meeting

ePermits Working Group Meeting Wednesday, September 11, 2013

Joint Meeting Data Work Group – Modeling Work Group

Seminar Meeting 11/2013

Zhaoming Yin Advisor: David A. Bader, Mar 25 th , 2014

Joint Meeting Data Work Group – Modeling Work Group

802.11 WG Editor’s Meeting (Nov ‘11)

Joint Leadership Meeting 2013 HAPPY FRIDAY!

Joint Outreach Task Group Meeting

Joint Advisory Board/Working Group Meeting

Zhaozheng Yin LPAC Group Meeting Jan.24 2008

SIM Group Meeting: Nov. 5, 2004 – CfA

UMRBA – ORSANCO JOINT MEETING JUNE 2013

NEWS SFT Group Meeting Feb 11 th 2013

PH-ESE Group Meeting 15 Nov. 2012

Zhaoming Yin School of CSE, Georgia Tech

11 th ATF Technical Board and System Group Coordinators Joint Meeting

11 th Implementation Group Meeting 1 st March 2013