420 likes | 433 Views
Robert K. Jansen School of Biology University of Texas at Austin Bernard M.E. Moret Department of Computer Science University of New Mexico Li-San Wang Tandy Warnow Department of Computer Sciences University of Texas at Austin. Genome Rearrangement Phylogeny. Outline. Introduction
E N D
Robert K. Jansen School of Biology University of Texas at Austin Bernard M.E. MoretDepartment of Computer ScienceUniversity of New MexicoLi-San Wang Tandy Warnow Department of Computer Sciences University of Texas at Austin Genome Rearrangement Phylogeny
Outline • Introduction • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research
New Phylogenetic Signals • Large-throughput sequencing efforts lead to larger datasets • Challenge: inferring deep evolutionary events • Biologists turning to “rare genomic changes” • Rare • Large state space • High signal-to-noise ratio • Potential for clarifying early evolution • Best studied: gene order evolution (genome rearrangement)
Genomes As Signed Permutations 1 –5 3 4 -2 -6 or 5 –1 6 2 -4 -3 etc.
Gene Order Data • Rare changes on the genomic scale • Large state space • DNA: 4 states/character • Protein (amino acid sequence): 20 states/character • Circular gene order with 120 genes: • High signal-to-noise ratio states/character
Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 4 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10
Edit Distances Between Genomes • (INV) Inversion distance [Hannenhalli & Pevzner 1995] • Computable in linear time [Moret et al 2001] • (BP) Breakpoint distance [Watterson et al. 1982] • Computable in linear time • NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999] A = 1 2 3 4 5 6 7 8 9 10 B = 1 2 3 -8 -7 -6 4 5 9 10 BP(A,B)=3
Our Model: the Generalized Nadeau-Taylor Model [STOC’01] • Three types of events: • Inversions (INV) • Transpositions (TRP) • Inverted Transpositions (ITP) • Events of the same type are equiprobable • Probabilities of the three types have fixed ratio • We focus on signed circular genomes in this talk.
True Tree C D A B E F D C E B F A Simulation Study Protocol Synthetic Input Evolutionary Process Known in simulation PhylogeneticMethod Inferred Tree
FN: false negative (missing edge) 1/3=33.3% error rate Quantifying Error
Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research
Breakpoint Phylogeny[Sankoff & Blanchette 1998] • “Maximum Parsimony”-style problem: • Find tree(s), leaf-labeled by genomes, with shortest breakpoint length • NP-hard problem on two levels: • Find the shortest tree (the space of trees has exponential size) • Given a tree, find its breakpoint length (Even for a tree with 3 leaves, but can be reduced to TSP) • BPAnalysis[Sankoff & Blanchette 1998] • Takes 200 years to compute our 13-taxon dataset on a Sun workstation
A C X’ X’ Y Y’ B Z BPAnalysis • Tree length evaluation for EVERY tree • Given a fixed tree topology, evaluate the tree length: • Iteratively evaluate the median problem (tree length for a 3-leaf tree)
GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) http://www.cs.unm.edu/~moret/GRAPPA/ • Uses lowerbound techniques to speed up • Used on real datasets, producing thousand-fold speedups over BPAnalysis [ISMB’01] • Contributors: (led by Bernard Moret at UNM) U. New MexicoU. Texas at AustinUniversitá di Bologna, Italy
4 1 2 3 The Circular Lowerbound of the Length of a Tree • Given a tree, we can lowerbound its length very quickly:
The Lowerbound Technique • Avoid any tree X without potential: • tree X whose lowerbound lb(X) is higher than twice the length c(T) of the best tree T • Finding a good starting tree quickly is of utmost importance • We turn to distance-based methods • Neighbor joining (NJ) [Saitou and Nei 1987] • Weighbor [Bruno et al. 2000]
Additive Distance Matrix and True Evolutionary Distance (T.E.D.) S1 S2 S3 S4 S5 S3 S1 0 9 15 14 17 S1 S2 0 14 13 16 S4 7 5 5 S3 0 13 16 3 1 S4 0 13 4 S5 0 8 S2 S5 Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.
Error Tolerance of Neighbor Joining Theorem[Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we havethen neighbor joining returns T.
BP and INV BP/2 vs K (120 genes) INV vs K (K: Actual number of inversions) (Inversion-only evolution)
NJ(BP) [Blanchette, Kunisawa, Sankoff 1999] and NJ(INV) Inversion only Transpositions/inverted transpositions only 120 genes, 160 leaves Uniformly Random Tree
Estimate True Evolutionary DistancesUsing BP • To use the scatter plot to • estimate the actual number • of events (K): • Compute BP/2 • From the curve, look up the corresponding valueof K (2) (1) BP/2 vs K (120 genes) (K: Actual number of inversions) (Inversion-only evolution)
True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data IEBP: Inverting the Expected BreakPoint distance EDE: Empirically Derived Estimator
True Evolutionary Distance Estimators BP vs K (120 genes) Exact-IEBP vs K (K: Actual number of inversions) (Inversion-only evolution)
There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) Weighbor[Bruno et al. 2000]These methods use the variance of good t.e.d.’s, and yield more accurate trees than NJ. Variance estimates for the t.e.d.s[Wang WABI’02] Weighbor(IEBP), Weighbor(EDE) Variance of True Evolutionary Distance Estimators K vs Exact-IEBP (120 genes)
Using T.E.D. Helps 120 genes160 leaves Uniformly random treeTranspositions/invertedtranspositions only(180 runs per figure) 5%
Observations • EDE is the best distance estimator when used with NJ and Weighbor. • True evolutionary distance estimators are reliable even when we do not know the GNT model parameters (the probability ratios of the three types of events).
Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research
Campanulaceae cpDNA • 13 taxa (tobacco as outlier) • 105 gene segments • GRAPPA finds 216 trees with shortest breakpoint length (out of 654,729,075 trees) • Running Time: • BPAnalysis takes 2 centuries on a Sun workstation • GRAPPA takes 1.5 hours on a 512-node supercluster • About 2300-fold speedup on a single node
Adenophora Cyananthus Tobacco Merciera Trachelium Triodanis Legousia Platycodon Wahlenbergia Symphandra Codonopsis Campanula Asyneuma Campanulaceae [Moret et al. ISMB 2001] Strict consensus of 216 optimal trees found by GRAPPA 6 out of 10 max. edges found
Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research
“Fast” Approaches for Genome Rearrangement Phylogeny • Basic technique: encode data as strings and apply maximum parsimony • Running time exponential in the number of genomes, but polynomial in the number of genes (faster than GRAPPA) • MPBE[ISMB’00]Maximum Parsimony using Binary Encodings • MPME[Boore et al. Nature ’95, PSB’02]Maximum Parsimony using Multi-state Encodings • The length of a tree using these two methods is a lowerbound of the true breakpoint length [Bryant ’01]
(-3,-4) (-2,1) (2,-3) (1,-4) (-4,1) (1,2) (2,3) (3,4) (4,1) Maximum Parsimony using Binary Encoding (MPBE) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPBE Strings A: 1 1 1 1 0 0 0 0 0 B: 0 1 1 0 1 1 0 0 0 C: 1 0 0 0 0 0 1 1 1
Maximum Parsimony using Multistate Encoding (MPME) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPME Strings We use PAUP to solve Maximum Parsimony => Constraint: number of states per site cannot exceed 32 1 2 3 4 -1 –2 –3 -4 A: 2 3 4 1 –4 –1 –2 -3 B: -4 3 4 –1 2 1 –2 -3 C: 2 –3 -2 3 4 –1 -4 1
NJ vs MP (120 genes, 160 genomes) All three event types equiprobable (datasets that exceed 32-state limit for MPME are dropped)
Inversion Phylogeny • Inversion median has higher running time than breakpoint median • Inversion phylogeny overall has shorter running time than breakpoint phylogeny, and returns more accurate trees [Moret et al. WABI ’02]
DCM-GRAPPA [Moret & Tang 2003] • Disk-Covering Method: divide the original problem into subproblems [Huson, Nettles, Parida, Warnow and Yooseph, 1998] • Uses inversion distance • DCM-GRAPPA: can now process thousands of genomes, each having hundreds of genes
Ongoing and Future Research • Genome rearrangement phylogeny with unequal gene content (duplications, deletions, etc.) • Non-uniform genome rearrangement models(Segment-length dependent model, hotspots)
Acknowledgements • University of Texas Tandy Warnow (Advisor) Robert K. Jansen Stacia Wyman Luay Nakhleh Usman Roshan Cara Stockham Jerry Sun • University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan • Central Washington University Linda Raubeson
PhylolabDepartment of Computer SciencesUniversity of Texas at Austin Please visit us at http://www.cs.utexas.edu/users/phylo/