180 likes | 345 Views
Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data. Kobe University. Toshiki Saitoh (ERATO) Joint work with Masashi Kiyomi (JAIST) Yoshio Okamoto (JAIST). Yokohama City University. The University of Electro-Communications.
E N D
Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Kobe University ToshikiSaitoh (ERATO) Joint work with Masashi Kiyomi (JAIST) Yoshio Okamoto (JAIST) Yokohama City University The University of Electro-Communications 11th International Symposium on Experimental Algorithms Bordeaux, France, June 7-9, 2012
Directed Binary Perfect Phylogeny ○ 0 → 1 ×1 → 0 • Input: A species-character matrix M • All characters are binary. • msc= 1iffthe species s has the character c • Output: A directed perfect phylogeny • An unordered rooted tree whose leaves have one species label. • Each character is labeled one node. • A species s has a character cif and only if the leaf with label s is a descendant of the node with label c. c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4
Directed Binary Perfect Phylogeny Lemma [Jannson, 2008] A matrix M admits a directed perfect phylogeny if and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. Ci : the set of species with thecharacterci We can construct a phylogeny in polynomial time. C3={s1, s2, s4} C4={s1, s4} C6={s3} c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4 C4C3
Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed C3 C1 C6 We can find one phylogeny in polynomial time. C2 C4 S5 S2 S3 [Pe’er et al., 2004] C5 S1 S4
Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed Enumeration of all perfect phylogenies from incomplete data C3 C1 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 S1 S1 S4 S5 S4 S2
Why Enumeration? • Data mining • Extraction of characters from all objects • Indexing • Counting • Random sampling • Searching • Filtering C1 C3 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 . . . S1 S1 S4 S5 S4 S2
Our Contribution • Proposing two enumeration algorithms • Branch and bound (B&B) • Output all perfect phylogenies one by one • Runs in O(|M| kh) time • k: #“?” in M, h: #perfect phylogenies • ZDD approach • Represent all perfect phylogenies compactly • Many applications • Counting, random sampling, filtering • Proof of #P-hardness of the counting problem • Reducing by counting the number of matchings in a bipartite graphs
What is a ZDD? • ZDD: Zero-suppressed Binary Decision Diagram • Proposed by Minato [Minato, 1993] • Compact representation for a boolean function • Aboolean function corresponds to a family of sets. {{x1,x2}, {x1, x3}, {x3}} x1 Example: F=(x1 x2 x3)˅(x2x3) x1 0 1 Reduction rules Uniqueness Zero-suppression x2 x2 x2 x3 x3 x3 x3 x3 0 1 1 0 0 0 0 1 1 0 ZDD of F Binary decision tree representing F
Reduction Rules Uniqueness 2. Zero-suppression Merge duplicate nodes (isomorphism subgraph) Eliminate redundant nodes x x x x 0 A ZDD represents a family of sets in a compressed way. There are algebraic operations for families of sets over ZDDs.
Algebraic Operations on ZDDs • Family algebras • Union, intersection, difference, join, quotient, remainder, etc. • Filtering objects in ZDDs • Counting (random sampling) and optimization These operations can be performed in almost linear time. x1 x2 x1 {{x1,x2}, {x1, x3}, {x3}} {{x1}, {x3}, {x1,x2}, {x1, x3}, {x1, x2, x3}} x3 x2 ˅ x3 x3 0 1 x1 x2 {{x1},{x1, x2, x3}} 0 1 x3 0 1
Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. x11 x21 x22 x22 x32 0 1
Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. for every distinct character ci and cj exactly one of the following three is satisfied. • for all species s, ifxsci=1 thenxscj=1 • for all species s, ifxsci=0 thenxscj=0 • for all species s, ifxsci=1 thenxscj=0 CiCj CjCi
Experiments • Instances: • Constructing an incomplete data from complete data • Random data set [Hudson, 2002] • “1” or “0” -> “?” with probability p (={0.1,0.2, 0.3, 0.4, 0.5}) • Matrix size (n, m): ({50, 100}, {50, 100}) • 100 instances for each triple (n, m, p) • B&B algorithm is written by C. • ZDD approach is written by C++(ZDD library is developed by Minato) • Machine spec • OS: SuSE Linux Enterprise Server 10 • CPU: Quad-Core AMD Opteron Processor 8393 • #CPUs 16, #Processors 32, Clock Freq. 3092MHz • Memory: 512GB
Experimental Results The number of solved instances by B&B and ZDD approach. (“solved” means that the algorithm successfully halts.) Timeout: 2 minutes
Experimental Results The size of ZDD is 1017.77 times smaller than the number of perfect phylogenies.
Conclusion • Our results • Proposing two enumeration algorithms • Branch and bound algorithm (B&B) • ZDD approach • ZDD approach solved more instances than B&B. • Spends more time with more the ZDD size. • Show high compression rate of ZDD for the random data. • Proof of #P-hardness of the counting problem Thank you for your attention!