1 / 26

Efficient Algorithms for Substring Near Neighbor Problem

Efficient Algorithms for Substring Near Neighbor Problem. Alexandr Andoni Piotr Indyk MIT. What’s SNN?. SNN ≈ Text Indexing with mismatches Text Indexing: Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T Text indexing with mismatches:

holli
Download Presentation

Efficient Algorithms for Substring Near Neighbor Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

  2. What’s SNN? • SNN ≈ Text Indexing with mismatches • Text Indexing: • Construct a data structure on a text T[1..n], s.t. • Given query P[1..m], finds occurrences of P in T • Text indexing with mismatches: • Given P, find the substrings of T that are equal to P except ≤R chars. • Motivation: e.g., computational bio (BLAST) T= GAGTAACTCAATA T= GAGTAACTCAATA P= AGTA

  3. Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks

  4. Approach (Or, why SNN?) • SNN = a near neighbor problem in Hamming metric with m dimensions: • Construct data structure on D={all substrings of T of length m}, s.t. • Given P, find a point in D that is at distance ≤R from P •  Use a NN data structure for Hamming T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …. AATA} P= AGTA

  5. Approximate NN • Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time) • Approximate NN is easier • Defined for approximation c=1+ε as • OK to report a point at distance ≤cR (when there is a point at distance ≤R) cR R q

  6. Our contribution • Problem: need m in advance for NN • Have to construct a data structure for each m≤M • Here: approx SNN data structure for unknown m • Without degradation in space or query time • Our algorithm for SNN based on LSH: • Supports patterns of length m≤M • Optimal* space: n1+1/c • Optimal* query time: n1/c • Slightly worse preprocessing time if c>3 • (* Optimal w.r.t. LSH, modulo subpoly factors) • Also extends to l1

  7. Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks

  8. Locality-Sensitive Hashing • Based on a family of hash functions {g} • For points P[1..m], Q[1..m]: • If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” • If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low” • Idea: • Construct L hash tables with random g1, g2, … gL • For query P, look at buckets g1(P), g2(P)… gL(P) • Space: L*n • Query time: L

  9. T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA … P= AGTA LSH for Hamming • Hash function g: • Projection on k random coordinates • E.g.: g1(“AGTA”)=“AA” (k=2) • L=#hash tables=n1/c • k=|log n / log(1-cR/m)| < m * log n R=1

  10. Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks

  11. T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …} HT1: GG-> GAG AT-> AGG, ACT,… … P= AGT R=1 Unknown m • Bad news • k dependent on m! • Distinct m  distinct hash tables g1(“AGT”)=“AT”

  12. T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Solution • Let’s just reuse the same data structure for all m • g(“AGTA”)=“AA” • On “AGT”  have to guess last char • g(“AGT?”)=g(“AGT?”) = “A?” • Like in [exact] text indexing…

  13. A … G A C A T T … … AGTA AATA ACTC AACT T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …} HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC … P= AGT R=1 Tries*! AGT AGTA • Replace HT1 with trie on g1(suffixes) • Stop search when outside P • Same analysis! * Tries have been used with LSH before in [MS02], but in a different context

  14. Resulting performance • Space: • n1+1/c (using compressed tries, one trie takes n space) • Optimal! • Query time: • n1/c * m (m=length P) • Not [yet] really optimal: originally, could do dim-reduction • Can improve to n1/c + mno(1) • Preprocessing time: • n1+1/c * M (M=max m) • Not optimal (optimal = n1+1/c) • Can improve to n1+1/c + M1/3 * n1+o(1) • Optimal for c<3

  15. Outline • General approach • View: Near Neighbor in Hamming • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution • Reducing query & preprocessing • Redesign LSH • Concluding remarks

  16. Better query & preprocessing • Redesign LSH to improve query and preprocessing: • Query: n1/c * m  n1/c + mno(1) • Preprocessing: n1+1/c * M  n1+1/c + n1+o(1) * M • Idea for new LSH • Use same # of hash tables/tries (#=L= n1/c) • But use “less randomness” in choosing hash functions g1, g2, …gL • S.t., each gi looks random, but g’s are not independent

  17. New LSH scheme • Old scheme: • Choose L hash functions gi • Each gi = projection on k random coordinates • New scheme: • Construct the L functions gi from a smaller number of “base” hash functions • A “base” hash function = projection on k/2 random coordinates • {gi ,i =1..L} = all pairs of “base” hash functions • Need only ~L1/2 “base” hash functions!

  18. u1= u2= u3= u4= g1=<u1, u2>= g2=<u1, u3>= g3=<u1, u4>= . . . Example k=4 w= #base fns=4 L=(w choose 2)=(4 choose 2)=6

  19. Saving time • Can save time since there are less “base” hash functions • E.g.: computing fingerprints • Want to compute FP(gi(P)) for i=1..L • FP(gi(P))=(Σj P[j] * χji * 2j) mod prime • Old way • Would take L * m time for L functions g • New way • Takes L1/2 * m time for L1/2 functions ui • Need only L time to combine FP(u(P)) into FP(g(P)) • If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime • Total: L + L1/2 * m

  20. Better query & preproc (2) • E.g., for query • Use fingerprints to leap faster in the trie • Yields time n1/c + n1/(2c) * m (since L= n1/c) • To get n1/c + no(1) * m, generalize: • g = tuple of t base functions • a base function = k/t random coordinates • Other details similar to fingerprints

  21. Better preprocessing (3) • Preprocessing, can get • n1+1/c + n1+o(1) * M • Can get n1+1/c + n1+o(1) * M1/3 • Can construct a trie in n * M1/3 (instead on n * M) • Using FFT, etc

  22. Outline • General approach • View: Near Neighbor problem in Hamming metric • Focus: reducing space • Background • Locality-Sensitive Hashing (LSH) • Solution = LSH + Tries • Reducing query & preprocessing • Redesign LSH • Concluding remarks

  23. Conclusions • Problem: • Substring Near Neighbor (a.k.a., text indexing with mismatches) • Approach: • View as NN in m-dimensional Hamming • Use LSH • Challenge: • Variable-length pattern w/o degradation in performance • Solution: • Space/query optimal (w.r.t. LSH) • Preprocessing optimal (w.r.t. LSH) for c<3

  24. Extensions • Extends to l1 • Nontrivial since a need a quite different LSH functions • Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3 • Using “Less-than-matching” problem [Amir-Farach’95]

  25. Remarks • Other approaches? • Or, why LSH for SNN? • Since better SNN  better NN… • And LSH is the “best” known algorithm for high-dimensional NN (using reasonable space)

  26. Thanks!

More Related