160 likes | 176 Views
Learn about species vs. gene trees, consensus tree types, and their asymptotic behavior. Explore gene tree discordance and probabilistic approaches optimizing consensus tree construction.
E N D
21 December 2007 Coalescent Consequences for Consensus Cladograms J. H. Degnan1, M. Degiorgio2, D. Bryant3, and N. A. Rosenberg1,2 1 Dept. of Human Genetics, U. of Michigan 2 Bioinformatics Program, U. of Michigan 3 Dept. of Mathematics, U. of Auckland
Outline • Species trees vs. gene trees • Consensus tree background • Asymptotic consensus trees • Finite sample consensus trees • Consistency results • Conclusions
Why? Incomplete lineage sorting, horizontal gene transfer, sampling, etc.
Gene tree discordance • From one true species tree, we expect there to be different gene trees at different loci as a result of lineage sorting, independently of problems due to estimation or sampling error. • Gene tree discordance depends especially on branch lengths in the species tree, measured by the number of generations scaled by effective population size, t / N.
Types of consensus trees • Strict—only clades that are included in observed trees are in the consensus tree. In the coalescent model, all clades have probability > 0. • Democratic vote—use the gene tree that occurs most frequently. • Majority rule—consensus tree has all clades that were observed in > 50% of trees. • Greedy—sort clades by their proportions. Accept the most frequently observed clades one at a time that are compatible with already accepted clades. Do this until you have a fully resolved tree. • R*—for each set of 3 taxa, find the most commonly occurring triple e.g., (AB)C, (AC)B or (BC)A. Build the tree from the most commonly occurring triples.
Asymptotic consensus trees • Consensus trees are usually statistics, functions of data like x-bar. • We consider replacing observed (estimated) gene trees with their theoretical probabilities under coalescence and determining the resulting consensus tree. • Motivation: if there are a large number of independent loci, observed gene tree and clade proportions should approximate their theoretical probabilities.
Tree/Clade Probability Examples x = y = 0.1 x = y = 0.05 ((AB)(CD)) p1 0.128 0.121 ((AC)(BD)) p2 0.099 0.105 ((AD)(BC)) p3 0.099 0.105 (((AB)C)D) p4 0.104 0.079 (((AB)D)C) p5 0.091 0.075 (((AC)B)D) p6 0.066 0.061 (((AC)D)B) p7 0.062 0.060 (((AD)B)C) p8 0.037 0.045 (((AD)C)B) p9 0.037 0.045 (((BC)A)D) p10 0.066 0.061 (((BC)D)A) p11 0.062 0.060 (((BD)A)C) p12 0.037 0.045 (((BD)C)A) p13 0.037 0.045 (((CD)A)B) p14 0.037 0.045 (((CD)B)A) p15 0.037 0.045 {AB} p1 + p4 + p5 0.332 (1) 0.275 (1) {AC} p2 + p6 + p7 0.227 (2) 0.226 (2) {AD} p3 + p8 + p9 0.173 (6) 0.189 (7) {BC} p3 + p10 + p11 0.226 (3) 0.226 (2) {BD} p2 + p12 + p13 0.173 (6) 0.195 (6) {CD} p1 + p14 + p15 0.202 (5) 0.211 (4) {ABC} p4 + p10 + p14 0.215 (4) 0.201 (5) {ABD} p5 + p8 + p12 0.165 (8) 0.165 (8) {ACD} p7 + p9 + p14 0.136 (9) 0.150 (9) {BCD} p11 + p13 + p15 0.136 (9) 0.150 (9) Greedy Tree (((AB)C)D) ((AB)(CD))
Tree/Triple Probability Examples x = y = 0.1 x = y = 0.05 ((AB)(CD)) p1 0.128 0.121 ((AC)(BD)) p2 0.099 0.105 ((AD)(BC)) p3 0.099 0.105 (((AB)C)D) p4 0.104 0.079 (((AB)D)C) p5 0.091 0.075 (((AC)B)D) p6 0.066 0.061 (((AC)D)B) p7 0.062 0.060 (((AD)B)C) p8 0.037 0.045 (((AD)C)B) p9 0.037 0.045 (((BC)A)D) p10 0.066 0.061 (((BC)D)A) p11 0.062 0.060 (((BD)A)C) p12 0.037 0.045 (((BD)C)A) p13 0.037 0.045 (((CD)A)B) p14 0.037 0.045 (((CD)B)A) p15 0.037 0.045 (AB)C* p1 + p4 + p5 + p8 + p12 0.397 0.365 (AC)B p2 + p6 + p7 + p9 + p14 0.301 0.316 (AB)D* p1 + p4 + p5 + p6 + p10 0.455 0.397 (AD)B p3 + p7 + p8 + p9 + p14 0.272 0.391 (AC)D* p2 + p4 + p6 + p7 + p10 0.397 0.366 (AD)C p3 + p5 + p8 + p9 + p12 0.301 0.315 (BC)D* p3 + p4 + p6 + p10 + p11 0.397 0.366 (BD)C p2 + p5 + p8 + p12 + p13 0.301 0.315 R* Tree (((AB)C)D) (((AB)C)D)
What about finite samples? • If you sample 10 loci, you could have: • All 10 match the species tree • 9 match the species tree, 1 disagrees • 8 match the species tree, 2 disagree, etc. • You can consider gene trees as categories and use multinomialprobabilities for the probability of your sample
Are consensus trees inconsistent estimators of species trees? • Theorem 1. Majority-rule asymptotic consensus trees (MACTs) do not have any clades not on the species tree. • Theorem 2. Greedy asymptotic consensus trees (GACTs) can be misleading estimators of species trees for the 4-taxon asymmetric tree and for any species tree with n > 4 species. • Theorem 3. R* asymptotic consensus trees (RACTs) always match the species tree.
Conclusions • Coalescent gene tree probabilities are useful for understanding asymptotic behavior of consensus trees constructed from independent gene trees. • Greedy consensus trees can be misleading, but are typically quicker to approach the species tree than majority-rule or R* when outside of the greedy zone. • R* consensus trees are consistent and more resolved than majority-rule consensus trees.