280 likes | 372 Views
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents. Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013. Contribution. Research Aspect
E N D
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov 11, 2013
Contribution • Research Aspect -A framework to solve the maximum parsimonious tree with the input of unequal genome contents. -Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space. -provide a benchmark for the HPC community. • Engineering Aspect -Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc. -The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).
Why Phylogenetic Tree Problem is Hard? • For N genomes, there are (N-3)!! number of possible tree topologies. • For each topology, we need to compute at least one different median, the possible median order are (g-2)!! . g is the number of genes. • To validate each possible median, if the gene content has duplications, it’s NP hard. • So the complexity type of computing the MP tree with uneuqal contents genomes is: NP hard over NP hard over NP hard!
Phylogenetic Tree This picture presents the phylogeny of the “12 Drosophila.” From http://insects.eugenes.org/species
Maximum Parsimony Concept 5 6 5 6 4 2 3 4 1 1 3 2 5 6 1 4 2 3 Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length
Genome Rearrangement http://ai.stanford.edu/~serafim/CS374_2006/presentations/lecture17.ppt
Genome Rearrangement In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution. 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 4 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10
Genome Median Computation 5 6 5 6 4 2 3 3 1 4 2 1 4 4 3 3 1 1 5 5 6 6 2 2
Genome Median Computation 1,2,3 4 1,-3,-2 -2,-1,3 3 1 5 6 1,2,3 = 2 moves 2,-1,3 = 5 moves ….. 2
4 3 5 2 6 1 7 8 Step 2-1: How to Compute Median (BNB) 4 4 3 3 5 5 4 3 2 2 5 6 6 2 6 1 1 7 7 1 8 8 7 8 4 3 5 2 6 1 7 8 4 3 5 2 6 4 4 3 3 5 5 1 2 2 6 6 7 8 1 1 7 7 8 8
Step 2-2: How to Compute Median (LK) …………………. stop
Step 2-2: How to Evaluate Median 1 1, 2, 3, 4, 3, 6, 5 med 1, 2, 3, 3, 4, 6, 5 2 1, 2, 3, 4, 6, 3, 5 3 1, 2, 5, 4, 6, 3, 3 Dis(m,1)+Dis(m,2)+Dis(m,3)
Step 2-2: How to Evaluate Median 1, 2, 3, 3, 4, 6, 5 1, 2, 3, 4, 3, 5 Find a mapping first (NP hard) dis=1 1, 2, 3, 3, 4, 6, 5 -2, -1, 3, 3, 4, 5 Complete the loss (polynomial) dis =2 1, 2, 3, 4, 6, 5 -2, -1, 3, 4, 6, 5 Compute DCJ (polynomial) dis =3 1, 2, 3, 4, 6, 5 1, 2, 3, 4, 6, 5
Step 3: Merge Disks Decomposition of The disks Construct a tree for each disk Merge the tree using A specific consensus method: Strict, majority etc… Disambiguation
Step 4: Initialization Init by insertion Which is local 4 3 1 5 6 c X 2 b 1 2 e Init by prospection Which is global. d
Step5: Iterative Refinement 1 2 a b 3 4
Review • Step 1: Spectral partition • Step 2: Subtree construction • Step 3: Supertree merge • Step 4: Initialization of complete tree using General Adequate Subgraph (GAS) method. • Step 5: Iterative Refinement until the complete tree converged.
Result—Simulated Data seed #Theta+ #gamma+ #phi operations We grow our own tree We know the total number of evolution event in the model tree
Result--Accuracy %of duplication 0.1 % of loss 0.1 Theta is % of inversion There are 8 species 2*8-3 =13edges. So the average accuracy is ~90%
Result – Real Data SCRaMbLE Matrix • We can represent a SCRaMbLEd strain by its vector. • The sign gives the orientation. • The color encodes the position in the synthetic chromosome.
Result – Real Data #inversion:#insertion/deletion:#duplication
Parallel Method [Bader 05] Load Balancing Parallel search
Why Many-core BnB? • So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR). • Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient. • But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).