240 likes | 350 Views
Phylogenetic trees dissimilarity measure based on strict frequent splits set and its application for clustering. Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology. Agenda. Background Introduction to phylogenetic trees
E N D
Phylogenetic trees dissimilarity measure basedon strict frequent splits set and its applicationfor clustering Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology
Agenda • Background • Introduction to phylogenetic trees • Split representation and consensus methods • Frequent split set representation • motivation • defnition • Interpretation • Frequent split set based dissimilarity measure • Clustering • Motivation • results
species2 species1 ancestor? species6 species5 species3 species4 Phylogenetic Tree ancestor species1 species5 species6 species2 species4 species3
Tree Representation Splits: a b a|bcdef b|acdef c|abdef d|abcef c e|abcdf f|abcde ab|cdef abc|def abcd|ef d e f
T1 T2 a b a b c c e d f d e f Robinson Foulds Distance Splits for tree T1: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abcd|ef Splits for tree T2: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abce|fd Uncommon splits: abcd|ef, abce|fd
Common Information Extractions • Consensus Methods • Strict Consensus Tree • Majority-rule Consensus Tree • Many others(Aho, Adams, …) • Maximum Agreement Subtree • Maximum Compatible Tree • Many others
T1 T2 a b Tc(T1,T2) a b a b c c c e f d d e f d e f Strict Consensus Tree Splits for tree T1: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abcd|ef Splits for tree T2: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abce|fd The common splits :a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def
Why CT is not so good? abc|defgi ?=? abc|defgj
Subsplit abc|defg⊆ abc|defgx, because abc⊆ abc and defg⊆ defgx, (is subsplit of) abc|defg⊆ abc|defgy, because abc⊆ abc and defg⊆ defgy, (is subsplit of)
Frequent Subsplit • Frequent subsplits in a profile of trees is a split that is a subsplit of at least one split in minsup of trees. Minsup=100% T1: cd|abefghi, bcd|aefghi, abcd|efghi, hi|abcdefg, ghi|abcdef, fghi|abcde, T2: bc|adefghj, abc|defghj,abcd|efghj,hj|abcdefg,ghj|abcdef,fghj|abcde, Frequent subsplit: bc|aefgh……….but also bc|aefgbc|aefbc|ae ……
Representative splitset • Representative splitset - a set thatcontains maximal frequent subsplits s, i.e. such that there is no other frequentsubsplit s2 that is also a supersplit of s. Minsup=100% (strict) T1: cd|abefghi, bcd|aefghi, abcd|efghi, hi|abcdefg, ghi|abcdef, fghi|abcde, T2: bc|adefghj, abc|defghj,abcd|efghj,hj|abcdefg,ghj|abcdef,fghj|abcde, RS: abcd|efgh, gh|abcdef, fgh|abcde, bc|aefgh,
Frequent split-set interpretation • Property 1. For each distinct leafset z from frequent splitset (FS) with a support greater then 50% a tree can be built. The tree is built on those splits from FS having a leafset as a superset of z. Therefore the frequent splitset (minsup>50%) can be represented as a set of trees. In particular, it affects the strict and majority-rule frequent set. • Property 2. Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.
Frequent split-set interpretation • Property 3. Properties 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from frequent split-set. • Properties 4. The set of trees resulting from the frequent splitset will contain also a consensus tree, provided that the input dataset of trees were built on the same leafset.
Clustering motivation • Phylogenetic trees reconstruction methods may produce many candidate trees • Hard to apply consensus methods to achieve one tree from profile of hundreds of trees • Clustering helps to designate small number of candidate trees form a large number of trees
c b e a d a b e c d Information in Tree a a a b c b b e e c e d d d c a a a b b b e e e c c c d d d
Information Loss • Cluster Information Loss – the amount of information that will be lost while replacing the cluster of trees with one representative tree • Clustering Information Loss – the amount of information that will be lost while replacing the input profile of trees with k representative trees
Agglomerative clustering • Typical Merging Strategies: • Single linkage • Complete linkage • Average Linkage • Our Merging Strategy: minimize information loss after merging • For SFS as Representative Tree:
Future Work • More efficient FS generation algorithm • Frequency-based clustering algoirthm