1 / 45

Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels. Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE). Setup. Start with a “model” binary tree. leaves = extant species. Image source: Bulbapedia. SETUP. Setup.

damon
Download Presentation

Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  2. Setup Start with a “model” binary tree leaves = extant species Image source: Bulbapedia SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  3. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  4. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  5. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  6. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  7. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  8. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  9. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  10. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  11. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. • Algorithm is given leaf bitstrings SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  12. Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. • Algorithm is given leaf bitstringsand must reconstruct the treewith high probability SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  13. Goal • Reconstruct the tree with high probability • Using as short a sequence length as possible (as a function of ) • While tolerating as large mutation probabilities as possible. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  14. Motivation • Many practical applications: • Reconstructing paths of migration • Linking mutations to disease • Determining origins of pathogens and likely paths of contamination • Informing policy on conservation of species Image source: Tim Lohrentz SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  15. Prior work • One method: Try to align sequences, reducing to substitution-only case. Image source: Wikipedia SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  16. Prior work • Theorem [DMR05]: Can reconstruct tree in substitution-only case using bits. • With length bitstrings, problem is impossible. • Requires (Kesten-Stigum threshold). • [BRZ95, Iof96, EKP00, BKM01, MSW04]: If , need bits for reconstruction. • If we have good alignment methods, we’ve solved the problem to optimality! SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  17. Prior work • Unfortunately, multiple sequence alignment is NP-hard, and heuristics used in practice may induce problematic biases. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  18. Prior work GWEJ07, LG08, WSH08 give empiricalevidence for biases in MSA. DR10, ABH12 provide some guaranteesfor reconstruction with indels, but require polynomial sequence lengths or very small . What can we handle with ? DR10 GZ18 ABH12 DMR05 ESSW99 We show the answer is . SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  19. Our contribution • In particular, if: • for all edges (Kesten-Stigum threshold) • Then we can reconstruct the tree with . • is off by a small multiplicative constant, and otherwise this result is optimal in every possible sense! • needed to avoid empty leaf strings SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  20. Roadmap 5. Future Directions SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  21. Roadmap 5. Future Directions SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  22. Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  23. Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  24. Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings • 2. Use distances to compute distances from parents ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  25. Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings • 2. Use distances to compute distances from parents • 3. Recurse on parents ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  26. Bitwise correlation • To estimate distance, in substitution-only case can use bitwise correlation (linear rescaling of Hamming similarity). • Think of bits as instead of -. Let be th bit of ’s bitstring. Bitwise correlation is . 50% similarity 84% similarity Image source: Clipart Library ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  27. Bitwise correlation • Claim: If we define edge lengths as • then ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  28. Concentration of bitwise correlation • How well does it concentrate? • Rough analysis: • Bitwise correlation has standard deviation . • For the correlation to concentrate at distance , need . ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  29. Roadmap 5. Future Directions Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  30. Handling indels • Problem with bitwise correlation when indels are introduced: What bit in ’s bitstring does bit of ’s bitstring correspond to? In the substitution only case: With indels: What if it doesn’t appear at all? Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  31. Blockwise correlation • To handle shifts due to insertions and deletions, split bitstrings into blocks. Assume everywhere (not too hard to generalize). If we split bitstrings into blocks of length, say, , most bits will stay within a block throughout the tree. Any bit shifts by positions inexpectation, at most with high probability on one edge. Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  32. Blockwise correlation • Define signature of block in bitstring, as sum of bits in block , divided by . • Signatures are robust to shifts, so they behave like bits in substitution only case, i.e. behaves like bitwise correlation. • Fixing any series of indels, is non-zero in expectation only if bits and correspond to each other. Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  33. Blockwise correlation • Define • Lemma: Error term to account for the tiny fraction of bits that move in/out of blocks Decay in number of corresponding bits between the blocks due to deletions Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  34. Blockwise correlation • Define • Lemma: • Concentration is a bit trickier because are not independent, for sake of time will ignore this for now… Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  35. Roadmap 5. Future Directions Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  36. Challenges with • With , bitwise correlation only concentrates at distance - previous method only reconstructs first levels. Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  37. Challenges with Use estimator Correct in expectation! But variance of correlation of and is large. … … Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  38. Challenges with To get better concentration, use median-of-means approach. Use approach on previous slide to estimate these distances (and thus ): Can show median concentrates! … … Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  39. Roadmap 5. Future Directions Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  40. Reconstructing signatures • Signatures are robust to shifts, so they behave similarly to bits in substitution-only case. • Suggests our algorithm: apply the reconstruction scheme to signatures. • Some technical challenges in the analysis we need to overcome. Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  41. Challenges with reconstruction with indels • In reconstructing signatures, bits appearing in blocks of ancestors but not children or vice-versa may induce noise that is non-zero in expectation in the recursive estimator. • We show that since the noise also “decays”, it is tiny in expectation, so misalignment does not ruin the reconstructed signatures. , ancestor whichwe condition on signal , descendant noise decayed signal noise Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  42. Challenges with reconstruction with indels • and are not independent, which makes concentration analysis for the recursive estimator more challenging. • We show that the covariance of the reconstructed blockwise correlations is small, i.e. andare almost completely independent. Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  43. Roadmap 5. Future Directions Future directions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  44. Open questions/future directions • Our reconstruction guarantees are optimal up to the constant in the exponent of – what is the right constant? • We only use bits of information per -bit sequence, so there should be room for improvement. • An algorithm using all the bits of information might also tell us more about the evolutionary history. • Can we remove some of the strong assumptions in the model? • What if there isn’t sitewise independence of mutations? • What if the root bitstring isn’t chosen uniformly at random? Future directions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

  45. Thank You!Questions? (Come chat at poster 99 tomorrow) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

More Related