Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Kobe University ToshikiSaitoh (ERATO) Joint work with Masashi Kiyomi (JAIST) Yoshio Okamoto (JAIST) Yokohama City University The University of Electro-Communications 11th International Symposium on Experimental Algorithms Bordeaux, France, June 7-9, 2012

Directed Binary Perfect Phylogeny ○　0　→　1 ×1　→　0 • Input: A species-character matrix M • All characters are binary. • msc= 1iffthe species s has the character c • Output: A directed perfect phylogeny • An unordered rooted tree whose leaves have one species label. • Each character is labeled one node. • A species s has a character cif and only if the leaf with label s is a descendant of the node with label c. c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4

Directed Binary Perfect Phylogeny Lemma [Jannson, 2008] A matrix M admits a directed perfect phylogeny if and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. Ci : the set of species with thecharacterci We can construct a phylogeny in polynomial time. C3={s1, s2, s4} C4={s1, s4} C6={s3} c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4 C4C3

Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed C3 C1 C6 We can find one phylogeny in polynomial time. C2 C4 S5 S2 S3 [Pe’er et al., 2004] C5 S1 S4

Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed Enumeration of all perfect phylogenies from incomplete data C3 C1 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 S1 S1 S4 S5 S4 S2

Why Enumeration? • Data mining • Extraction of characters from all objects • Indexing • Counting • Random sampling • Searching • Filtering C1 C3 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 . . . S1 S1 S4 S5 S4 S2

Our Contribution • Proposing two enumeration algorithms • Branch and bound (B&B) • Output all perfect phylogenies one by one • Runs in O(|M| kh) time • k: #“?” in M, h: #perfect phylogenies • ZDD approach • Represent all perfect phylogenies compactly • Many applications • Counting, random sampling, filtering • Proof of #P-hardness of the counting problem • Reducing by counting the number of matchings in a bipartite graphs

What is a ZDD? • ZDD: Zero-suppressed Binary Decision Diagram • Proposed by Minato [Minato, 1993] • Compact representation for a boolean function • Aboolean function corresponds to a family of sets. {{x1,x2}, {x1, x3}, {x3}} x1 Example: F=(x1 x2 x3)˅(x2x3) x1 0 1 Reduction rules Uniqueness Zero-suppression x2 x2 x2 x3 x3 x3 x3 x3 0 1 1 0 0 0 0 1 1 0 ZDD of F Binary decision tree representing F

Reduction Rules Uniqueness 2. Zero-suppression Merge duplicate nodes (isomorphism subgraph) Eliminate redundant nodes x x x x 0 A ZDD represents a family of sets in a compressed way. There are algebraic operations for families of sets over ZDDs.

Algebraic Operations on ZDDs • Family algebras • Union, intersection, difference, join, quotient, remainder, etc. • Filtering objects in ZDDs • Counting (random sampling) and optimization These operations can be performed in almost linear time. x1 x2 x1 {{x1,x2}, {x1, x3}, {x3}} {{x1}, {x3}, {x1,x2}, {x1, x3}, {x1, x2, x3}} x3 x2 ˅ x3 x3 0 1 x1 x2 {{x1},{x1, x2, x3}} 0 1 x3 0 1

Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. x11 x21 x22 x22 x32 0 1

Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. for every distinct character ci and cj exactly one of the following three is satisfied. • for all species s, ifxsci=1 thenxscj=1 • for all species s, ifxsci=0 thenxscj=0 • for all species s, ifxsci=1 thenxscj=0 CiCj CjCi

Experiments • Instances: • Constructing an incomplete data from complete data • Random data set [Hudson, 2002] • “1” or “0” -> “?” with probability p (={0.1,0.2, 0.3, 0.4, 0.5}) • Matrix size (n, m): ({50, 100}, {50, 100}) • 100 instances for each triple (n, m, p) • B&B algorithm is written by C. • ZDD approach is written by C++(ZDD library is developed by Minato) • Machine spec • OS: SuSE Linux Enterprise Server 10 • CPU: Quad-Core AMD Opteron Processor 8393 • #CPUs 16, #Processors 32, Clock Freq. 3092MHz • Memory: 512GB

Experimental Results The number of solved instances by B&B and ZDD approach. (“solved” means that the algorithm successfully halts.) Timeout: 2 minutes

Experimental Results

Experimental Results The size of ZDD is 1017.77 times smaller than the number of perfect phylogenies.

Conclusion • Our results • Proposing two enumeration algorithms • Branch and bound algorithm (B&B) • ZDD approach • ZDD approach solved more instances than B&B. • Spends more time with more the ZDD size. • Show high compression rate of ZDD for the random data. • Proof of #P-hardness of the counting problem Thank you for your attention!

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data

Presentation Transcript

Lecture 5 Incomplete data

High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data

Phylogenies

Building Phylogenies

Accounts from incomplete records

ENUMERATION OF MICROORGANISMS

Perfect binary trees

Efficient Implementation of Property Directed Reachability

Accounts from Incomplete Records

Accounts from Incomplete Records

Phylogenies

Reconstructing Phylogenies from Gene-Order Data

Enumeration

Enumeration Data Type

Summarising Sets of Phylogenies

I/O Efficient Directed Model Checking

Description of enumeration data

Methods of Enumeration

Rank Minimization for Subspace Tracking from Incomplete Data

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

Incomplete information: Perfect Bayesian equilibrium

Reconstructing Phylogenies from Gene-Order Data