230 likes | 358 Views
A heuristic approach to Maximum Weighted Quartet Compatibility. Rezwana Reaz. Department of Computer Science University of Texas at Austin. Gene trees and species tree. Species tree – pattern of branching of species lineages via speciation.
E N D
A heuristic approach to Maximum Weighted Quartet Compatibility RezwanaReaz Department of Computer Science University of Texas at Austin
Gene trees and species tree • Species tree – pattern of branching of species lineages via speciation. • Gene tree – A phylogenetic tree that depicts how a singlegene has evolved in a group of related species.
Discordance Species tree • Gene trees don’t necessarily show the same branching pattern as their containing species tree D C A B Gene tree
Reasons of Discordance • Duplication and loss • Horizontal Gene Transfer • Incomplete Lineage Sorting/ Deep Coalescence
Deep Coalescence • Gene copies fail to coalesce in the speciation point. Gene copies at a single locus extends deeper than the speciation events • Coalescence theory visualizes the process as if it operated backwards in time. Population size Generation Courtesy: ShamsuzzohaBayzid
Discordance by Deep Coalescence D C B A Courtesy: ShamsuzzohaBayzid
Two competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Summary Method Estimating Species Tress from Multiple Genes Species Concatenation Courtesy: Tandy Warnow
Estimating Species Tress from Multiple Genes • Existing summary methods: • MP-EST • MRP • Greedy etc. • In this project, we have developed a new technique to estimate species tree from a set of estimated gene trees • using Quartet decomposition of the gene trees
Motivation Anomalous Gene Tree (Degnan and Rosenberg, 2009) • Most likely gene tree topology is different from the species tree topology. DecomposeGene Trees into Quartets ? “when there are only four species, with one lineage sampled from each, the most likely unrootedgene tree topology has the sameunrooted topology as the species tree”. [Allman et al., 2011]
Estimating Species Tree from True Gene Trees N genes True gene tree for gene 1 True gene tree for gene 2 True gene tree for gene N …. Quartet Decomposition …. Q1 Q2 QN Estimate species tree for every 4 species For every 4 species, take the most frequent gene tree topology as the species tree Q Combine unrooted 4-taxon species trees ST
Estimating Species Tree from Estimated Gene Trees Phase 1: Generate Weighted Quartets N genes Bootstrap gene trees for gene 1 Bootstrap gene trees for gene 2 Bootstrap gene trees for gene N …. Quartet Decomposition Compute bootstrap support values …. Q1 Q2 QN (q11, b11), (q12, b12), .. (q21, b21), (q22, b22), .. (qN1, bN1), (qN2, bN2), .. …. Combine into one set & calculate weights (q1, w1), (q2, w2), (q3, w3), .. weight = average bootstrap support value over N genes.
Estimating Species Tree from Estimated Gene Trees Phase 2:Supertree Construction • Problem: Maximum Weighted Quartet Compatibility (MQC) • Input: A set Q of quartets q1, q2, …,qk with positive weights, w1, w2, …, wkrespectively on a set of taxal . • Output: Tree T on the set of taxal so as to maximize the sum of the weights of the satisfied quartets on T. 1 1 2 3 3 2 4 2 3 4 4 5 5 2 4 5 4 We proposed a method WQFM (Weighted Quartet FM) 1 2 3 ST
Experimental Dataset • 37-taxon Mammal Dataset • 200 genes and 500 bp • For each gene, 200 bootstrap replicate trees • Under moderate ILS Phase 1: Generate Weighted Quartets Phase 2:Supertree Construction using WQFM
Results Other results are obtained from “Statistical Binning” project.
Proposed Method Input :A set Q of quartets q1, q2, …,qkon a set of taxal with positive weights, w1, w2, …, wkrespectively . Output :Tree T on the set of taxal so as to maximize the sum of the weights of the satisfied quartets onT. A divide and conquer approach 1 2 1 2 3 4 q1 : ((1, 2), (3, 4)) Q : q2: ((1, 3), (2, 4)) q3: ((2, 3), (4, 5)) 3 2 3 4 4 5 q4 : ((1, 2), (5, 6)) q5 : ((2, 3), (4, 5)) Input Quartets q6 : ((1, 3), (5, 6)) Recursively subdivide P and Q P : { 1, 2, 3, 4, 5, 6} Set of Taxa
Proposed Method A divide and conquer approach ((1, 2), (3, 4)) ((1, 2), (5, 6)) ((1, 3), (2, 4)) ((3, 4), (5, 6)) Q 3 4 ((2, 3), (4, 5)) ((1, 3), (5, 6)) P { 1, 2, 3, 4, 5, 6} ((1, 3), (2, X)) {4, 5, 6, } X P2 {1, 2, 3, } X P1 (( X, 4), (5, 6)) ((1, 2), (3, X)) {1, 2, } Y {3, X, } Y {4, X, } Y {5, 6, } Y 5 5 3 1 1 4 Y Y Y Y Y Y 6 6 X 2 2 X We call this method WQFM (Weighted Quartet FM)
Future Work • To analyze the approach • For various simulated and biological dataset • Under different model conditions • -Varying amount of ILS • -Varying number of genes • -Varying sequence length
Acknowledgement Dr. Shel Swenson -helping me with generating weighted quartets from a set of bootstrap gene trees. Md. ShamsuzzohaBayzid- for helpful discussions, suggestions and helping me with setting up the experimental pipeline.
Thanks! Any Question
Partition Score 1 3 1 1 1 2 4 5 2 5 3 5 5 4 4 6 6 6 3 2 3 3 2 4 q5 q2 q1 q6 q4 q3 satisfied violated deferred satisfied deferred Pa = { } 1 , 2 Pb = { } 3 , 6 , 4 , 5 Partition Score = Sum of weights of satisfied – Sum of weights of violated
Gain 1 1 3 1 1 2 3 2 5 5 5 4 4 6 5 6 6 4 3 3 2 3 4 2 q1 q4 q2 q6 q5 q3 deferred deferred satisfied satisfied deferred satisfied Pa = { } 1 , 2 Pb = { } 3 , 6 , 4 , 5 Partition Score = Sum of weights of satisfied – Sum of weights of violated Gain (3) = Partition Score (after moving 3) – Partition Score (before moving 3)
Bipartition Method MFM (Modified FM) Bipartition Algorithm Gain(1) =-1 Gain(2) =-1 Gain(3) = 2 Gain(4) =-1 Gain(5) =-3 Gain(6) =-2 Max Cumulative Gain = 2 3 3 1 1 4 4 2 4 5 2 5 6 6 Gain(1) =-4 Gain(2) =-2 Gain(4) = 0 Gain(5) =-4 Gain(6) =-4 Gain(6) =2 1 3 3 1 2 4 4 2 5 5 6 6 Gain(1) =-2 Gain(2) =-2 Gain(5) =-3 Gain(6) =-3 Gain(5) =-1 Gain(6) =-2 1 2 1 4 3 2 3 5 5 4 6 6 Gain(1) =-1 Gain(5) =-2 Gain(6) =-3 1 2 3 5 4 6
Bipartition Method 4 1 5 2 6 3 Initial Partition for Next iteration Rollback Iterations continue until Maximum Cumulative Gain is Zero