1 / 24

Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Emiran Curtmola @ UC San Diego Alin Deutsch @ UC San Diego. K.K. Ramakrishnan @ at&t Divesh Srivastava @ at&t. Load-Balanced Query Dissemination in Privacy-Aware Online Communities. Motivation. DATA. ONLINE COMMUNITIES. Typical such applications are centralized

reuben
Download Presentation

Load-Balanced Query Dissemination in Privacy-Aware Online Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EmiranCurtmola@ UC San Diego Alin Deutsch @ UC San Diego K.K. Ramakrishnan @ at&t DiveshSrivastava @ at&t Load-Balanced Query Dissemination in Privacy-Aware Online Communities

  2. Motivation DATA ONLINE COMMUNITIES • Typical such applications are centralized • Hosted online communities • Search engines • Limitations • Disintermediation of publishers from queriers • Publishers need to give up their data • Central site controls visibility of publishers to queriers • Publishers loose their right to privacy SIGMOD, June 2010

  3. New Requirement for Publishing in Online Communities • Free data exchange within the community • Some users want to remain autonomous • User privacy (i.e., not all users may want to reveal their true identity) • Publishers express their opinions anonymously to avoid association with sensitive or controversial issues (e.g., political, race, religion..) • User autonomy + privacy suggest adecentralized infrastructure SIGMOD, June 2010

  4. Privacy Guarantee: Publisher k-anonymity • Make safer to join and post data for publishers • Prevent association of sensitive topics with publishers that contribute to them even if compromised nodes  Publisher k-anonymity: For every publisher p and data item d, hide p in a k-protected crowd of publishers: there are at least other k-1 potential publishers of the same d SIGMOD, June 2010

  5. The Virtual Newspaper Community(Design Requirements) • Allow publishers keep complete control over their data • Disseminate queries in the network, not data • Publishers answer queries at their own discretion • Published data is not traceable back to publishers even if compromised nodes The community data collection P4 P5 P3 P3 P2 local XML data local XML data local XML data P5 P7 P7 P8 P6 P1 local XML data local XML data P4 local XML data Query Q1: find the articles mentioning the Olympics in Beijing P8 P1 P2 P6 local XML data local XML data Query Q2: find the articles about Tibet Query Q3: find the articles mentioning poverty Query Q4: find the articles that give the money in Hong Kong How to query ad-hoc distributed data sources while preserving user privacy? SIGMOD, June 2010

  6. Challenges in Querying Distributed Sources • Infrastructure setup such that • Distribution of data • Large nr. of decentralized publishers and consumers • User privacy • Efficient query routing (to avoid flooding the network) SIGMOD, June 2010

  7. A Query Dissemination Tree (QDT) • Build an overlay network to act as a distributed index • Peers are organized into logical query dissemination trees (QDTs) • Use QDTs to disseminate queries using node summaries Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong 1 2 13 8 3 14 16 9 union of its subtrees’ summaries P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 router P6 P7 P8 P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty P publisher SIGMOD, June 2010

  8. Query Routing in a QDT Q3 Q3 Q3 Q3 Q3 Q3 Q3  check set inclusion: query into node’s summary Q3=“poverty” 1 Bloom Filter Pruning 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 Only P1 and P2 publish articles about poverty P2 P1 P3 18 21 24 …poverty… …poverty… P6 P7 P8 Q3 SIGMOD, June 2010

  9. Privacy Preservation in QDTs Q3=“poverty” • Minimum information at each node • No node has global information • Node summaries are vectors of counters (bloom filters) representing hash values of advertised data items • Queries reach publishers in such a manner that users do not know if publisher does not respond vs. does not have matching documents 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 SIGMOD, June 2010

  10. Publisher Privacy if Compromised Nodes • If an edge node is compromised • Risk: Individual updates of node summaries (from publishers to edge routers) may expose the publishers • Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and... 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 Protected crowd SIGMOD, June 2010

  11. Secure Multi-Party (SMP) Upd1 +Upd2 +Upd3 • Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and use secure-multi party (SMP) computation inside crowds to advertise updates of published terms to the edge routers Edge router 4 4 -R +Upd1 +R P2 P1 +Upd2 +Upd3 P3 Publisher 3-anonymous protected crowd SIGMOD, June 2010

  12. Publisher Privacy if Compromised Nodes • If an internal node is compromised • Risk: Node summary of advertised terms is exposed → Downstream may contain sensitive content but the crowd of publishers is even bigger now.. 1 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 poverty… poverty… P6 P7 P8 Protected crowd SIGMOD, June 2010

  13. Design Trade-off: Privacy preservation vs. Performance The tree topology introduces congestion at upper QDT levels during query dissemination How to relieve the congestion?  SIGMOD, June 2010

  14. Techniques for Load Balancing • Overlaying multiple logical QDTs over the same underlay network • A physical node belongs to multiple logical QDTs but at different levels • Goal: organize the nodes into QDTs such that the distribution of tree levels for a node is uniform across the QDTs SIGMOD, June 2010

  15. Overlaying Multiple QDTs: 4-QDTs QDT1 QDT2 1 1 QDT3 QDT4 1 1 SIGMOD, June 2010

  16. Query Routing for Multiple QDTs • Partition community data collection into disjoint blocks • Build one QDT tree per block B • QDTigroups all publishers with terms in Bi • Routing a query • Terms in query determine the relevant blocks • Send query to the corresponding QDT • Check the full query with publishers Q3=“poverty” Q3 falls in B4 useQDT4: QDT1 QDT2 QDT3 QDT4 …poverty… …poverty… SIGMOD, June 2010

  17. Relieving the Congestion Q1=“Olympics”, “Beijing” QDT1 QDT2 Q3=“poverty” QDT3 QDT4 SIGMOD, June 2010

  18. Queries Spanning on Multiple Blocks • Q4=“Hong Kong”, “money” • Route Q4 on both trees? • Query selectivity optimization techniques: Choose the selective QDT to route on by maintaining only 1-3% of popular data items (see paper) QDT3 QDT4 SIGMOD, June 2010

  19. The Design Space:Find nr. QDTs to balance the load? • Build one QDT for all advertised terms • Con: traffic congestion in upper levels • Pro: most aggressive pruning based on conjunctions ideal load • Build one QDT per advertised term • Con: tree maintenance (as many QDTs as terms) • Con: single-term queries are less selective (more traffic) • Pro: congestion-free  “Sweet spot” expected to lie between THE above extremes Our solution  SIGMOD, June 2010

  20. Finding the Sweet Spot • Empirical fact: Upper two levels in a QDT are the most congested • Model: cyclical permutation of nodes on the tree levels nr of QDTs for load balance = nr of legal permutations (i.e., without breaking the fairness property) Fairness property: all routers appear precisely once in the top two levels of any QDT SIGMOD, June 2010

  21. Measuring Throughput • Overall throughput depends heavily on the most congested node • Look at node stress in terms of nr. of messages • going into a node: Processing Load at a node (PLoad) • going out of a node: Forwarding Load at a node (FLoad) • Throughput indicator: compare how far are ↔ P F ideal load (avg. load for 1-QDT = ) peak load (k-QDTs) nr.msgs nr.nodes SIGMOD, June 2010

  22. Maximize Throughput: When is Peak Closest to Ideal Load? • Experiment 1: PLoad for Scribe QDT topology • Result: nr. QDTs for load balance found experimentally coincides with that given by our analytical model • Load balance with • How close: 32% closest to ideal PLoad • How close: 923% closest to ideal FLoad  To balance FLoad, need node fanouts to be the same  • Experiment 2: FLoad for fanout-balanced QDT topologies • How close: 18% closest to ideal Pload • How close: 130% closest to ideal FLoad SIGMOD, June 2010

  23. Take-Away • Propose a novel publishing infrastructure • Empowers publishers to join and post without being associated with (sensitive) content • Generic solution: it extracts the maximum load balance supported by the QDT topology SIGMOD, June 2010

  24. Thank you! SIGMOD, June 2010

More Related