Imputing Supertrees and Supernetworks from Quartets

Imputing Supertrees and Supernetworks from Quartets By B. Holland, G. Conner, K. Huber, and V. Moulton Presented by Razieh Nokhbeh Zaeem

This talk • Basic problem: constructing an estimate of a species phylogeny (in this case, network) from a given set of gene trees • Input: a set of partial gene trees (not all taxa) • Output: a supernetwork, allowing the conflicting signals • Algorithm by Holland et al. • combines quartet-imputation with consensus network construction • Experiments comparing the new method to previous method Z-closure and to MRP with respect to “False Positives”, “False Negatives”. • Q-imputation provides a useful complementary tool

Q-imputation • Some definitions: L(T), T|Z, Q(T) and • Let … : collection of input trees corresponding to a collection of gene trees. • Put • For each tree , we sequentially insert all of the taxa in into to get • Once we get all s, we apply consensus network method to obtain a network

Polynomial time alg: For each For each new taxon y: Find a place to add a pendant edge labeled by y We are trying to choose place p s.t. it maximizes the # of agreed quartets between and all other s Choose randomly if there is more than one place to add y to get the best score If the max score is 0 we don’t have enough information

The consensus network • The consensus network (the split network): Those splits of X that are displayed by more than a certain proportion, t, of the trees computed by Q-imputation • In case t = 0 we drop the subscript t: splits which appear at least once • For example: • If t = 100, then the consensus network is a strict-consensus tree • If t = 50, then the consensus network is the majority-rule consensus tree • If t < 50, then the consensus network may display conflicting splits

Simulation • Three different types of input: (3 types of simulations) • Evolution is tree like. Gene trees are correct, but miss taxa • Evolution is tree like. Gene trees have errors and miss taxa • Evolution is not tree like. Random input trees. • In each simulation, three parameters were varied: • The species tree, either • The completely balanced tree on 16 taxa or • The completely unbalanced tree on 16 taxa • g taking values 2, 4, 8, 16, and 32 • m (The number of taxa missing) taking values 1, 2, 3, 4, 5, and 6, deleted randomly • One hundred repetitions were carried out for each parameter combination.

Simulation • The split systems generated were: • MRP: and , the splits in the majority-rule consensus and strict consensus from MRP. • Q-imputation: , and • Z-closure: the splits generated using Z-closure • Measuring FP and FN • FP: splits contained in the output split system that are not in the input • FN: splits in input that are not in the output split system

WIP • Definition: weak induction property (WIP): • For input trees … any split S in should restrict to a split in for some • The WIP holds for all splits in in case input trees are all subtrees of a phylogenetic tree. • There are examples where WIP does not hold, although very few generated by Q-imputation. • Z-closure satisfies WIP • Any method with WIP property cannot generate FP: Every split in output has come from some tree in the input set, so there is not split which appears in output but not input. • Q-imputation with t=0 cannot produce FN

Simulation results: FP • Z-closure cannot generate FP, so we just look at splits in Q-imputation and MRP. • 6000 different settings for each type of simulation. • Normalized numbers in parenthesis. • Each tree on 16 taxa, 13 internal edges.

Simulation 1 results: FN, normalized, % Z-closure Q-imoutaion20 MRP50

Discussion on simulation results • By increasing the # of gene trees: • FN produced by Z-closure reduces (good) • FN produced by Q-imputation increases (bad) • As a supertree method (simulation 1 & 2), Q-imputation tended to return fewer FP (unsupported) splits, but also fewer supported splits (more FN (?)) than MRP • As a supernetwork method, Q-imputation tended to give rise to FP but not FN(?), whereas Z-closure gave rise to FN but no FP • Also, in simulations where there was an underlying species tree, while increasing number of gene trees: • For Z-closure the number of FN increased (?) • For the split system derived from applying a threshold to the trees completed by Q ‑ imputation, the number of FN had the desirable property of decreasing (?) • For the output to be visually palatable, we need to have some FN to restrict the number of splits that are being displayed. • Q-imputation: a natural means to filter out splits. • Look at case study.

Case study 7 genes, 45 taxa Z-closure Q-imputation

Imputing Supertrees and Supernetworks from Quartets

Imputing Supertrees and Supernetworks from Quartets

Presentation Transcript

Supertrees: Algorithms and Databases

Imputing species-level plot basal area and tree density attributes from remotely sensed data in north-central Idaho

Imputing Wages to Activities

Supernetworks

Exercise 1:Imputing measured Chl (z) data

Ensembles - Quartets

Imputing Consumption – Concepts and Simplified Example

Imputing HLA Alleles from SNPs

Introduction: use of supertrees

Modified Mincut Supertrees

MCB 372 #12: Tree, Quartets and Supermatrix Approaches

Barbershop Quartets

Finding Supertrees Using Distance Methods

Virtual Center for Supernetworks

Phylogenetic supertrees: seeing the data for the trees

MCB 372 #12: Tree, Quartets and Supermatrix Approaches

IMPUTING MISSING ADMINISTRATIVE DATA FOR SHORT-TERM ENTERPRISE STATISTICS

The OptIPuter – From SuperComputers to SuperNetworks

Editing and Imputing VAT Data for the Purpose of Producing Mixed-Source Turnover Estimates