10 likes | 137 Views
Redundant Bit Vectors for Fast High-Dim Lookup. Database & CCSP Research Groups. Jonathan Goldstein, John Platt, Chris Burges. Problem Statement. Our Three Key Ideas. Partitioning the Queries. bin. Query. S 2. S 1. Use redundancy to combat high dimensionality.
E N D
Redundant Bit Vectors for Fast High-Dim Lookup Database & CCSP Research Groups Jonathan Goldstein, John Platt, Chris Burges Problem Statement Our Three Key Ideas Partitioning the Queries bin Query S2 S1 • Use redundancy to combat high dimensionality. • Use bit vector indices to keep the representation small. • Use redundancy by partitioning the queries, not the data. S3 dim 1 BV1 BV2 BV3 BV4 Query S4 S5 S1 S5 S4 S2 Data sphere S6 S3 dim 2 • Find all data spheres that contain the query point • Low, non-zero error is OK BV5 BV7 BV8 BV6 Construct BV6: 101101 Bit Vector Indices • Redundancy: point stored as “yes” result for multiple bins per dimension • When querying, a single bit vector index is picked per dimension; ANDed together. • The results are post-filtered using the actual data spheres. Applications Bit vector indices • Server for audio fingerprinting: look up matches for audio • Fingerprinting in general • Pub/sub filtering 1 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 Point ID AND Existing Technology Results for RARE • Space partitioning with no redundancy • R-trees, SS-trees, etc. • worse than linear scan for truly high dimensional data! • Advantages: • Small: 1 bit/point/query part • Fast: 1 CPU cycle operates on 32 points in parallel ! • 56x faster than linear scan • Introduces no false negatives • Post-filtering 1000x faster than full linear scan