240 likes | 314 Views
Emiran Curtmola @ UC San Diego Alin Deutsch @ UC San Diego. K.K. Ramakrishnan @ at&t Divesh Srivastava @ at&t. Load-Balanced Query Dissemination in Privacy-Aware Online Communities. Motivation. DATA. ONLINE COMMUNITIES. Typical such applications are centralized
E N D
EmiranCurtmola@ UC San Diego Alin Deutsch @ UC San Diego K.K. Ramakrishnan @ at&t DiveshSrivastava @ at&t Load-Balanced Query Dissemination in Privacy-Aware Online Communities
Motivation DATA ONLINE COMMUNITIES • Typical such applications are centralized • Hosted online communities • Search engines • Limitations • Disintermediation of publishers from queriers • Publishers need to give up their data • Central site controls visibility of publishers to queriers • Publishers loose their right to privacy SIGMOD, June 2010
New Requirement for Publishing in Online Communities • Free data exchange within the community • Some users want to remain autonomous • User privacy (i.e., not all users may want to reveal their true identity) • Publishers express their opinions anonymously to avoid association with sensitive or controversial issues (e.g., political, race, religion..) • User autonomy + privacy suggest adecentralized infrastructure SIGMOD, June 2010
Privacy Guarantee: Publisher k-anonymity • Make safer to join and post data for publishers • Prevent association of sensitive topics with publishers that contribute to them even if compromised nodes Publisher k-anonymity: For every publisher p and data item d, hide p in a k-protected crowd of publishers: there are at least other k-1 potential publishers of the same d SIGMOD, June 2010
The Virtual Newspaper Community(Design Requirements) • Allow publishers keep complete control over their data • Disseminate queries in the network, not data • Publishers answer queries at their own discretion • Published data is not traceable back to publishers even if compromised nodes The community data collection P4 P5 P3 P3 P2 local XML data local XML data local XML data P5 P7 P7 P8 P6 P1 local XML data local XML data P4 local XML data Query Q1: find the articles mentioning the Olympics in Beijing P8 P1 P2 P6 local XML data local XML data Query Q2: find the articles about Tibet Query Q3: find the articles mentioning poverty Query Q4: find the articles that give the money in Hong Kong How to query ad-hoc distributed data sources while preserving user privacy? SIGMOD, June 2010
Challenges in Querying Distributed Sources • Infrastructure setup such that • Distribution of data • Large nr. of decentralized publishers and consumers • User privacy • Efficient query routing (to avoid flooding the network) SIGMOD, June 2010
A Query Dissemination Tree (QDT) • Build an overlay network to act as a distributed index • Peers are organized into logical query dissemination trees (QDTs) • Use QDTs to disseminate queries using node summaries Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong 1 2 13 8 3 14 16 9 union of its subtrees’ summaries P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 router P6 P7 P8 P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty P publisher SIGMOD, June 2010
Query Routing in a QDT Q3 Q3 Q3 Q3 Q3 Q3 Q3 check set inclusion: query into node’s summary Q3=“poverty” 1 Bloom Filter Pruning 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 Only P1 and P2 publish articles about poverty P2 P1 P3 18 21 24 …poverty… …poverty… P6 P7 P8 Q3 SIGMOD, June 2010
Privacy Preservation in QDTs Q3=“poverty” • Minimum information at each node • No node has global information • Node summaries are vectors of counters (bloom filters) representing hash values of advertised data items • Queries reach publishers in such a manner that users do not know if publisher does not respond vs. does not have matching documents 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 SIGMOD, June 2010
Publisher Privacy if Compromised Nodes • If an edge node is compromised • Risk: Individual updates of node summaries (from publishers to edge routers) may expose the publishers • Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and... 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 Protected crowd SIGMOD, June 2010
Secure Multi-Party (SMP) Upd1 +Upd2 +Upd3 • Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and use secure-multi party (SMP) computation inside crowds to advertise updates of published terms to the edge routers Edge router 4 4 -R +Upd1 +R P2 P1 +Upd2 +Upd3 P3 Publisher 3-anonymous protected crowd SIGMOD, June 2010
Publisher Privacy if Compromised Nodes • If an internal node is compromised • Risk: Node summary of advertised terms is exposed → Downstream may contain sensitive content but the crowd of publishers is even bigger now.. 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 Protected crowd SIGMOD, June 2010
Design Trade-off: Privacy preservation vs. Performance The tree topology introduces congestion at upper QDT levels during query dissemination How to relieve the congestion? SIGMOD, June 2010
Techniques for Load Balancing • Overlaying multiple logical QDTs over the same underlay network • A physical node belongs to multiple logical QDTs but at different levels • Goal: organize the nodes into QDTs such that the distribution of tree levels for a node is uniform across the QDTs SIGMOD, June 2010
Overlaying Multiple QDTs: 4-QDTs QDT1 QDT2 1 1 QDT3 QDT4 1 1 SIGMOD, June 2010
Query Routing for Multiple QDTs • Partition community data collection into disjoint blocks • Build one QDT tree per block B • QDTigroups all publishers with terms in Bi • Routing a query • Terms in query determine the relevant blocks • Send query to the corresponding QDT • Check the full query with publishers Q3=“poverty” Q3 falls in B4 useQDT4: QDT1 QDT2 QDT3 QDT4 …poverty… …poverty… SIGMOD, June 2010
Relieving the Congestion Q1=“Olympics”, “Beijing” QDT1 QDT2 Q3=“poverty” QDT3 QDT4 SIGMOD, June 2010
Queries Spanning on Multiple Blocks • Q4=“Hong Kong”, “money” • Route Q4 on both trees? • Query selectivity optimization techniques: Choose the selective QDT to route on by maintaining only 1-3% of popular data items (see paper) QDT3 QDT4 SIGMOD, June 2010
The Design Space:Find nr. QDTs to balance the load? • Build one QDT for all advertised terms • Con: traffic congestion in upper levels • Pro: most aggressive pruning based on conjunctions ideal load • Build one QDT per advertised term • Con: tree maintenance (as many QDTs as terms) • Con: single-term queries are less selective (more traffic) • Pro: congestion-free “Sweet spot” expected to lie between THE above extremes Our solution SIGMOD, June 2010
Finding the Sweet Spot • Empirical fact: Upper two levels in a QDT are the most congested • Model: cyclical permutation of nodes on the tree levels nr of QDTs for load balance = nr of legal permutations (i.e., without breaking the fairness property) Fairness property: all routers appear precisely once in the top two levels of any QDT SIGMOD, June 2010
Measuring Throughput • Overall throughput depends heavily on the most congested node • Look at node stress in terms of nr. of messages • going into a node: Processing Load at a node (PLoad) • going out of a node: Forwarding Load at a node (FLoad) • Throughput indicator: compare how far are ↔ P F ideal load (avg. load for 1-QDT = ) peak load (k-QDTs) nr.msgs nr.nodes SIGMOD, June 2010
Maximize Throughput: When is Peak Closest to Ideal Load? • Experiment 1: PLoad for Scribe QDT topology • Result: nr. QDTs for load balance found experimentally coincides with that given by our analytical model • Load balance with • How close: 32% closest to ideal PLoad • How close: 923% closest to ideal FLoad To balance FLoad, need node fanouts to be the same • Experiment 2: FLoad for fanout-balanced QDT topologies • How close: 18% closest to ideal Pload • How close: 130% closest to ideal FLoad SIGMOD, June 2010
Take-Away • Propose a novel publishing infrastructure • Empowers publishers to join and post without being associated with (sensitive) content • Generic solution: it extracts the maximum load balance supported by the QDT topology SIGMOD, June 2010
Thank you! SIGMOD, June 2010