450 likes | 710 Views
Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels. Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE). Setup. Start with a “model” binary tree. leaves = extant species. Image source: Bulbapedia. SETUP. Setup.
E N D
Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup Start with a “model” binary tree leaves = extant species Image source: Bulbapedia SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. • Algorithm is given leaf bitstrings SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Setup • Start with a “model” binary tree • Sample a uniformly random -bit string (DNA) for root • DNA is inherited down the treewith mutations. On edge , eachbit: • -Substitutes w.p. • -Inserts random bit w.p. • -Deletes w.p. • Algorithm is given leaf bitstringsand must reconstruct the treewith high probability SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Goal • Reconstruct the tree with high probability • Using as short a sequence length as possible (as a function of ) • While tolerating as large mutation probabilities as possible. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Motivation • Many practical applications: • Reconstructing paths of migration • Linking mutations to disease • Determining origins of pathogens and likely paths of contamination • Informing policy on conservation of species Image source: Tim Lohrentz SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Prior work • One method: Try to align sequences, reducing to substitution-only case. Image source: Wikipedia SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Prior work • Theorem [DMR05]: Can reconstruct tree in substitution-only case using bits. • With length bitstrings, problem is impossible. • Requires (Kesten-Stigum threshold). • [BRZ95, Iof96, EKP00, BKM01, MSW04]: If , need bits for reconstruction. • If we have good alignment methods, we’ve solved the problem to optimality! SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Prior work • Unfortunately, multiple sequence alignment is NP-hard, and heuristics used in practice may induce problematic biases. SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Prior work GWEJ07, LG08, WSH08 give empiricalevidence for biases in MSA. DR10, ABH12 provide some guaranteesfor reconstruction with indels, but require polynomial sequence lengths or very small . What can we handle with ? DR10 GZ18 ABH12 DMR05 ESSW99 We show the answer is . SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Our contribution • In particular, if: • for all edges (Kesten-Stigum threshold) • Then we can reconstruct the tree with . • is off by a small multiplicative constant, and otherwise this result is optimal in every possible sense! • needed to avoid empty leaf strings SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions SETUP Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings • 2. Use distances to compute distances from parents ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Distance estimation • Well known: distance estimates that concentrate well suffice to reconstruct the tree. • High-level algorithm: • 1. Use distances to identify siblings • 2. Use distances to compute distances from parents • 3. Recurse on parents ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Bitwise correlation • To estimate distance, in substitution-only case can use bitwise correlation (linear rescaling of Hamming similarity). • Think of bits as instead of -. Let be th bit of ’s bitstring. Bitwise correlation is . 50% similarity 84% similarity Image source: Clipart Library ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Bitwise correlation • Claim: If we define edge lengths as • then ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Concentration of bitwise correlation • How well does it concentrate? • Rough analysis: • Bitwise correlation has standard deviation . • For the correlation to concentrate at distance , need . ESTIMATING DISTANCES USING BITWISE CORRELATION Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Handling indels • Problem with bitwise correlation when indels are introduced: What bit in ’s bitstring does bit of ’s bitstring correspond to? In the substitution only case: With indels: What if it doesn’t appear at all? Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Blockwise correlation • To handle shifts due to insertions and deletions, split bitstrings into blocks. Assume everywhere (not too hard to generalize). If we split bitstrings into blocks of length, say, , most bits will stay within a block throughout the tree. Any bit shifts by positions inexpectation, at most with high probability on one edge. Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Blockwise correlation • Define signature of block in bitstring, as sum of bits in block , divided by . • Signatures are robust to shifts, so they behave like bits in substitution only case, i.e. behaves like bitwise correlation. • Fixing any series of indels, is non-zero in expectation only if bits and correspond to each other. Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Blockwise correlation • Define • Lemma: Error term to account for the tiny fraction of bits that move in/out of blocks Decay in number of corresponding bits between the blocks due to deletions Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Blockwise correlation • Define • Lemma: • Concentration is a bit trickier because are not independent, for sake of time will ignore this for now… Using block signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Challenges with • With , bitwise correlation only concentrates at distance - previous method only reconstructs first levels. Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Challenges with Use estimator Correct in expectation! But variance of correlation of and is large. … … Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Challenges with To get better concentration, use median-of-means approach. Use approach on previous slide to estimate these distances (and thus ): Can show median concentrates! … … Reconstructing bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Reconstructing signatures • Signatures are robust to shifts, so they behave similarly to bits in substitution-only case. • Suggests our algorithm: apply the reconstruction scheme to signatures. • Some technical challenges in the analysis we need to overcome. Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Challenges with reconstruction with indels • In reconstructing signatures, bits appearing in blocks of ancestors but not children or vice-versa may induce noise that is non-zero in expectation in the recursive estimator. • We show that since the noise also “decays”, it is tiny in expectation, so misalignment does not ruin the reconstructed signatures. , ancestor whichwe condition on signal , descendant noise decayed signal noise Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Challenges with reconstruction with indels • and are not independent, which makes concentration analysis for the recursive estimator more challenging. • We show that the covariance of the reconstructed blockwise correlations is small, i.e. andare almost completely independent. Reconstructing signatures Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Roadmap 5. Future Directions Future directions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Open questions/future directions • Our reconstruction guarantees are optimal up to the constant in the exponent of – what is the right constant? • We only use bits of information per -bit sequence, so there should be room for improvement. • An algorithm using all the bits of information might also tell us more about the evolutionary history. • Can we remove some of the strong assumptions in the model? • What if there isn’t sitewise independence of mutations? • What if the root bitstring isn’t chosen uniformly at random? Future directions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Thank You!Questions? (Come chat at poster 99 tomorrow) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels