Bandwidth-Efficient Continuous Query Processing over DHTs

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu

Background • Instantaneous Query • Continuous Query

Instantaneous Query (1) • Documents are indexed • Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists) • Retrieve “one-time” relevant docs • Latency is a top priority • Query Q = t1Λ t2 … • Fetch lists of doc IDs stored under t1, t2…. • Intersect these lists • E.g.: Google search engine

cat? cat:1,4,7,19,20 dog? “cat Λ dog”? dog:1,5,7,26 Instantaneous Query (2) cat:1,4,7,19,20 A B C dog:1,5,7,26 Send Result: Docs 1,7 D cow:2,4,8,18 bat: 1,8,31

Continuous Query (1) • Reverse the role of documents and queries • Queries are indexed • Query Q = t1 Λ t2 … stored at one of the terms t1, t2 … • Question 1: How is the index term selected?(query indexing) • “Push” new relevant docs (incrementally) • Enabled by “long-lived” queries • E.g.: Google New Alert feature

Continuous Query (2) • Upon a new doc D = t1Λ t2 (insertion) • Contacts the nodes responsible for the inverted query lists of D’s keywords t1 and t2 • Question 2: How to locate the nodes (query nodes QN)? (document announcement) • Resolve the query lists  the final list of satisfied queries (by D) • Question 3: What is the resolution strategy? (query resolution) • E.g., Term Dialogue, Bloom filters (Infocom’06) • Notify owners of satisfied queries

1. Document announcement Doc 2. “dog” & “cow” cat dog cow 3. “11” (bit vector) 4. “horse” 5. “0” (bit vector) Query Resolution: Term Dialogue Inver. list for “cat” • cat (query): • dog • horse & dog • horse & cow A B Notify owner of Q1 C D Inver. list for “cow” Inver. list for “dog”

Doc 1. Doc announcement “10110” (bloom filter) 2. “dog” (Term Dialogue) cat dog cow 3. “1” (bit vector) Query Resolution: Bloom filters Inver. list for “cat” • cat (query): • dog • horse & dog • horse & cow A B Notify owner of Q1 C D Inver. list for “cow” Inver. list for “dog”

Motivation • Latency is not the primary concern, but bandwidth can be one of the important design issues • Various query indexing schemes incur different cost • Various query resolution strategies cause different costs  Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

Contributions • Novel query indexing schemes  Question #1 • Focus of this talk! • Multicast-based document announcement  Question #2 • In the paper • Adaptive query resolution  Question #3 • Make intelligent decisions in resolving query terms • Minimize the bandwidth cost • In the full tech. report paper

Design • Focus on simple keyword queries, e.g., Q = t1Λ t2Λ … Λtn • Leverage DHTs • Location & storage of documents and continuous queries • Query indexing • How to choose index terms for queries? • Doc. announcement, query resolution • Not covered in this talk!

Current Indexing Schemes • Random Indexing (RI) • Optimal Indexing (OI)

Random Indexing (RI) • Randomly chooses a term as index term • Q = t1Λ … Λ tm • Index term ti is randomly selected • Q is indexed in a DHT node responsible for ti • Pros: simple • Cons: • Popular terms are more likely to be index terms for queries • Load imbalance • Introduce many irrelevant queries in query resolution, wasting bandwidth

Optimal Indexing (OI) • Q = t1Λ … Λ tm • Index term ti is deterministically chosen, the most selective term, i.e., with the least frequency • Q is indexed in a DHT node responsible for ti • Pros: • Maximize load balance & minimize bandwidth cost • Cons: • Assume perfect knowledge of term statistics • Impractical, e.g., due to large number of documents, node churn, continuous doc updates, ….

Solution 1: MHI • Minimum Hash Indexing • Order query terms by their hashes • Select the term with minimum hash as the index term • Q = t1Λ… Λ tm • Index termti is deterministically chosen, s.t. h(ti) < h(tx) (for all x≠i) • Q is indexed in a DHT node responsible for ti

RI v.s. MHI D = {t2, t4, t5, t6} t1 t2 t3 t4 t5 t6 t7 • Where h(ti) < h(tj) for i < j. • 3 queries, irrelevant to D: • Q1= t1Λt2Λ t4 • Q2= t3Λ t4Λ t5 • Q3= t3Λ t5Λ t6 • (1) RI: • Q1, Q2, and Q3 will be considered in query resolution each with • probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6) • (2) MHI • All of them will be filtered out!  bandwidth savings! • How?

Disregarded in query resolution, saving bandwidth! MHI: filtering irrelevant queries! t2: none B No action t5: none C t1: Q1 G No action A D = {t2, t4, t5, t6} t4: none t3: Q2, Q3 F D No action No action t6: none E • Q1= t1 Λ t2 Λ t4 • Q2= t3 Λ t4 Λ t5 • Q3= t3 Λ t5 Λ t6

MHI • Pros: • Simple and deterministic • Does not require term stats • Saves bandwidth over RI (up to 39.3% saving for various query types) • Cons: • Some popular terms can be index terms by their minimum hashes in their queries! • Load imbalance & irrelevant queries to process

Solution 2: SAP-MHI • MHI is good but may still index queries under popular terms • SAmPling-based MHI(SAP-MHI) • Sampling (synopsis of K popular terms) + MHI • Avoid indexing queries under K popular terms • Challenge: support duplicate-sensitive aggregates of popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! • Borrow idea from duplicate-sensitive aggregation in sensor networks

SAP-MHI • Duplicate-sensitive aggregation • Goal:  a synopsis of K popular terms • Based on coin tossing experiment CT(y) • Toss a fair coin until either the first head occurs or ycoin tosses end up with no head, and return the number of tosses • Each node a • Produce a local synopsis Sa containing K popular terms (the terms with the highest values of CT(y)) • Gossip Sa to its neighbor nodes • Upon receiving a synopsis Sb from a neigbor b, aggregate Sa and Sb, producing a new synopsis Sa(max() operations) • Thus, each node has a synopsis of K popular terms after a sufficient number of gossip rounds • Intuition: If a term appears in more documents then its value produced by CT(y) will be larger than the values of rare terms

SAP-MHI: Indexing Example • Query Q=t1Λ t2Λ t3Λ t4Λ t5, where h(t1)<h(t2)<h(t3)<h(t4)<h(t5) • Synopsis S={t1,t2} • Q is indexed on the node which is responsible for t3, instead of t1

Simulations

SAP-MHI v.s. MHI SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

SAP-MHI v.s. MHI Bloom filters are used in query resolution.

SAP-MHI v.s. MHI Term Dialogue is used in query resolution.

SAP-MHI v.s. MHI This shows why SAP-MHI saves bandwidth over MHI!

Summary • Focus on a simple keyword query model • Bandwidth is a top priority • Query indexing impacts bandwidth cost • Goal: Sift out as many irrelevant queries as possible! • MHI and SAP-MHI • SAP-MHI is a more viable solution • Load is more balanced, more bandwidth saving! • Sampling cost is controlled • # of popular terms is relatively low • Memberships of popular terms do not change rapidly • Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

Thank You!

Bandwidth-Efficient Continuous Query Processing over DHTs