Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

Phylogenetic trees dissimilarity measure basedon strict frequent splits set and its applicationfor clustering Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

Agenda • Background • Introduction to phylogenetic trees • Split representation and consensus methods • Frequent split set representation • motivation • defnition • Interpretation • Frequent split set based dissimilarity measure • Clustering • Motivation • results

species2 species1 ancestor? species6 species5 species3 species4 Phylogenetic Tree ancestor species1 species5 species6 species2 species4 species3

Common Information Extractions • Consensus Methods • Strict Consensus Tree • Majority-rule Consensus Tree • Many others(Aho, Adams, …) • Maximum Agreement Subtree • Maximum Compatible Tree • Many others

Why CT is not so good? abc|defgi ?=? abc|defgj

Subsplit abc|defg⊆ abc|defgx, because abc⊆ abc and defg⊆ defgx, (is subsplit of) abc|defg⊆ abc|defgy, because abc⊆ abc and defg⊆ defgy, (is subsplit of)

Frequent split-set interpretation • Property 1. For each distinct leafset z from frequent splitset (FS) with a support greater then 50% a tree can be built. The tree is built on those splits from FS having a leafset as a superset of z. Therefore the frequent splitset (minsup>50%) can be represented as a set of trees. In particular, it affects the strict and majority-rule frequent set. • Property 2. Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.

Frequent split-set interpretation • Property 3. Properties 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from frequent split-set. • Properties 4. The set of trees resulting from the frequent splitset will contain also a consensus tree, provided that the input dataset of trees were built on the same leafset.

Example

Clustering motivation • Phylogenetic trees reconstruction methods may produce many candidate trees • Hard to apply consensus methods to achieve one tree from profile of hundreds of trees • Clustering helps to designate small number of candidate trees form a large number of trees

Dissimilarity Measure Example

c b e a d a b e c d Information in Tree a a a b c b b e e c e d d d c a a a b b b e e e c c c d d d

Information Loss • Cluster Information Loss – the amount of information that will be lost while replacing the cluster of trees with one representative tree • Clustering Information Loss – the amount of information that will be lost while replacing the input profile of trees with k representative trees

Agglomerative clustering • Typical Merging Strategies: • Single linkage • Complete linkage • Average Linkage • Our Merging Strategy: minimize information loss after merging • For SFS as Representative Tree:

Results (camp)

Results (DT)

Agg-inf vs others (camp)

Future Work • More efficient FS generation algorithm • Frequency-based clustering algoirthm

Thank You

Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

Presentation Transcript

University of Warsaw

Institute of Sociology, University of Warsaw

Institute of International Relations University of Warsaw

research of computer Science in California Institute of Technology

UNIVERSITY OF WARSAW

WARSAW UNIVERSITY OF TECHNOLOGY INSTITUTE OF HEAT ENGINEERING DIVISION OF AEROENGINES

Anna Rybak Institute of Computer Science, University of Białystok, Poland

Computer Mongolian National University of Science and Technology

Warsaw University of Technology Off-Campus Płock

Warsaw University of Technology

Institute of Geophysics Faculty of Physics University of Warsaw

Institute of Science and Technology

WARSAW UNIVERSITY OF TECHNOLOGY FACULTY OF MATHEMATICS AND INFORMATION SCIENCE

Warsaw University of Technology

WARSAW UNIVERSITY OF TECHNOLOGY FACULTY OF MATHEMATICS AND INFORMATION SCIENCE

WUT = Warsaw University of Technology

WARSAW UNIVERSITY OF TECHNOLOGY FACULTY OF MATHEMATICS AND INFORMATION SCIENCE

Warsaw University of Technology

Free University of Berlin Institute of Computer Science AI Group

Institute of Sociology, University of Warsaw