600 likes | 716 Views
Trustworthy Distributed Search and Retrieval over the Internet. Yung-Ting Chuang Electrical and Computer Engineering University of California, Santa Barbara May 3, 2013 Committee Members: Professor P. Michael Melliar-Smith, Chair Professor Louise E. Moser Professor Timothy P. Sherwood
E N D
Trustworthy Distributed Search and Retrieval over the Internet Yung-Ting Chuang Electrical and Computer Engineering University of California, Santa Barbara May 3, 2013 Committee Members: Professor P. Michael Melliar-Smith, Chair Professor Louise E. Moser Professor Timothy P. Sherwood Professor Volkan Rodoplu Yung-Ting Chuang's Ph.D. Defense
Outline • Motivation • Trustworthy Distributed Search and Retrieval • Protecting against Malicious Attacks in iTrust • Membership Management for iTrust • Statistical Inference and Dynamic Adaptation for iTrust • Conclusions and Future Work Yung-Ting Chuang's Ph.D. Defense
Motivation • Information is accessed over the Internet using centralized search engines • Benefits - efficient, robust, and scalable • Drawbacks – depends on administrators remaining benign • Thus, we present a decentralized and distributed search and retrieval system • Benefits – prevent censorship and filtering of information • Drawbacks – • Need more network bandwidth • Difficult to infer membership size and malicious nodes Yung-Ting Chuang's Ph.D. Defense
Trustworthy Distributed Search and Retrieval Related Work Design of iTrust Implementation of iTrust User Interface of iTrust Performance Evaluation of iTrust Summary Yung-Ting Chuang's Ph.D. Defense
1. Related Work • Survey by Mischeke and Risson on distributed search: • Structured – Require nodes to be organized in an overlay network • Distributed Hash Table (DHT), Ring, Tree, Skip Lists • Unstructured – Typically gossip-based, and use randomization • Flooding / Broadcast => Gnutella • Random walk and data replication => Sarshar, GIA, Lv • Key-based routing => Freenet • Direct routing => Pub-2-Sub • Square root function => Cohen, Zhong, Ferreira • P2P systems concerned with security, privacy, and trust • Quasar–Uses a structured overlay and protects user’s sensitive information • OneSwarm– Uses a combination of trusted and untrusted peers and protect the privacy of the users • GOSSPLE – Fully decentralized system for social acquaintances using a gossip protocol. Yung-Ting Chuang's Ph.D. Defense
Source of Information 2. Design of iTrusta) Distribution of Metadata Yung-Ting Chuang's Ph.D. Defense
Source of Information Request Encounters Metadata Requester of Information 2. Design of iTrust b) Distribution of a Request Yung-Ting Chuang's Ph.D. Defense
Source of Information Requester of Information 2. Design of iTrust c) Retrieval of Information Request Matched Yung-Ting Chuang's Ph.D. Defense
3. Implementation of the iTrust System Yung-Ting Chuang's Ph.D. Defense
4. User Interface of iTrust Yung-Ting Chuang's Ph.D. Defense
4. User Interface of iTrust Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation of iTrusta) Analytical Model • Notation • Membership contains n participating nodes • x is the proportion of participating nodes that are operational • Metadata are distributed to m nodes • Requests are distributed to r nodes • k nodes report matches to a requesting node (for the same metadata and the same request) Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation of iTrusta) Analytical Model • Probability of k matches is: • Probability of one or more match is: Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation of iTrusta) Analytical Model Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation of iTrustb) Analysis vs. Emulation Yung-Ting Chuang's Ph.D. Defense
6. Summary • Problem we are trying to solve: • Centralized search engines can be tampered with to bias the results, or to conceal or censor information • Our solutions and contributions: • We have implemented iTrust, which is a decentralized distributed search and retrieval system with no centralized mechanisms and no centralized control • We have demonstrated that the match probability is high, even if some participating nodes are subverted or non-operational Yung-Ting Chuang's Ph.D. Defense
Protecting against Malicious Attacks in iTrust Background Related Work Foundations Detecting Malicious Attacks Defending against Malicious Attacks Performance Evaluation Summary Yung-Ting Chuang's Ph.D. Defense
1. Background • Potential attacks: • Nodes do not match requests • Nodes do not return responses to requester • Effect of such attacks • Probability of a match is decreased • Existing work that addresses attacks: • Place nodes on a blacklist (Jesi) • Maintains a reputation or trust score (Condie) • Our solution to such attacks is: • Estimate the proportion of malicious nodes • Increase the number of nodes to which requests are distributed in order to restore match probability Yung-Ting Chuang's Ph.D. Defense
2. Related Work • Work related to our detection algorithm • Exponential Weighted Moving Average (EWMA) • Roberts et al. - For discovering anomalies and issuing alerts • Chi-squared test • Goonatilake - For detecting intrusions • Press et al. - For balancing weights of buckets • Belen and Heckert – For determining similarity between two models • EWMA and Chi-squared test • Ye and Chen - For anomaly detection and intrusion detection • Work related to our defensive adaptation algorithm: • Morselli – Uses feedback mechanism to adjust the replicas to improve search result • Leng – Uses maintainer to determine, update, and eliminate the data replicas Yung-Ting Chuang's Ph.D. Defense
3. Foundationsa) Normalization • We cannot use requests that return k=0 responses • Because there might be no metadata to match • Probability of k matches is negligibly small, when k is large • Thus, we exclude requests for k=0 and for k > K • Our normalization equation is: • where Yung-Ting Chuang's Ph.D. Defense
3. Foundationsb) Exponential Weighted Moving Average • The EWMA method is computed as follows: where c is the weighting factor for the EWMA method Yung-Ting Chuang's Ph.D. Defense
3. Foundationsc) Chi-Squaredvs. Modified Chi-Squared • Pearson’s chi-squared statistic: • Pearson’s modified chi-squared statistic: where: • ok: the actual number of observations that fall into kth bucket • ek: the expected number of observations for the kth bucket • K: the number of buckets into which the observations fall Yung-Ting Chuang's Ph.D. Defense
3. Foundationsd) Chi-Squared vs Modified Chi-Squared Yung-Ting Chuang's Ph.D. Defense
4. Detecting Malicious Attacksa) Detection Algorithm • Collects responses for its request using EWMA method • Normalize empirical probabilities • Uses modified chi-squared test to compare the empirical probabilities against the analytical probabilities for x=1.0, 0.7, 0.4, and 0.2 • Chooses the smallest value of chi-squared to estimate x’ Yung-Ting Chuang's Ph.D. Defense
4. Detecting Malicious Attacksb) Example Yung-Ting Chuang's Ph.D. Defense
5. Defending against Malicious Attacksa) Defensive Adaptation Algorithm • Initialize r 0 • Calculate yo based on current r with given n, m, and x. • Determine whether the yo is greater than the expected match probability. • If not, increase r by 1 and go back to step 2 • If so, return r Yung-Ting Chuang's Ph.D. Defense
5. Defending against Malicious Attacksb) Example Yung-Ting Chuang's Ph.D. Defense
6. Performance Evaluationa) Varying the number of nodes Yung-Ting Chuang's Ph.D. Defense
6. Performance Evaluation Yung-Ting Chuang's Ph.D. Defense
7. Summary • Problem we are trying to solve in this chapter: • Absence of centralized control makes it difficult to determine the proportion of non-operational nodes in the network • Our solution and contributions: • A node can estimate the proportion of non-operational nodes in the network based on the responses to its requests • A node calculates the number of nodes to which the requests are distributed to maintain a high match probability • A node infers useful but unobservable information about the network as a whole by observing aspects of the behaviors of individual nodes that are visible to it Yung-Ting Chuang's Ph.D. Defense
Membership Management for iTrust Background Related Work iTrust Membership Protocols Foundations Performance Evaluation Extended Scenario Summary Yung-Ting Chuang's Ph.D. Defense
1. Background • Churn – Nodes joining and leaving the membership • Challenging tasks • Estimating membership and membership size • Estimating churn • Existing work that addresses churn: • Passive Monitoring (Sen et al., Gummadi et al.) • Active Probing (Chu et al., Liang, Bhagwan et al.) • Gossiping (Bizenhofer, Pruteanuet al) • Our approach to address churn: • Nodes don’t predict churn characteristics in advance • Each node maintains its local view of the membership and uses statistical inference to update its view Yung-Ting Chuang's Ph.D. Defense
2. Related Work • Work related to membership management: • Zage – Biases neighbor selections toward beneficial nodes • SCAMP – Nodes discover joining and leaving nodes through gossiping • CYCLON – Nodes maintain a small and fixed-size neighbor list, with a shuffling protocol for large networks • Newcast – Each node periodically selects a peer to exchange and update its membership list • Work related to churn: • Bizenhofer and Pruteanu et al. - Estimate the churn rate through gossiping • Stutzbach & Rejaie - Study churn characteristics, highlight problems that cause biased peer selections. • Paulo et al. – Maintains dynamic mapping of flows according to the current set of neighbors • Liu – Presents an age-based membership protocol with a conservative neighbor maintenance scheme under churn • Horowitz et al. – Relies on the departure and arrival of nodes to estimate the current network size, without requiring any additional communication Yung-Ting Chuang's Ph.D. Defense
Joining Node 3. iTrust Membership Protocolsa) Joining the Membership Bootstrapping Node Yung-Ting Chuang's Ph.D. Defense
3. iTrust Membership Protocolsb) Leaving the Membership Leaving Node Yung-Ting Chuang's Ph.D. Defense
3. iTrust Membership Protocolsc) Distributing Metadata Discover New Node Discover Leaving Node Source Node Yung-Ting Chuang's Ph.D. Defense
3. iTrust Membership Protocolsd) Distributing Requests Redistribute Metadata Discover Leaving Node Discover New Node Requesting Node Yung-Ting Chuang's Ph.D. Defense
4. Foundationsa) Metrics • LND: Leaves Not Detected • JND: Joins Not Detected • MA: Membership Accuracy • MP: Match Probability for a request • RT: Response Time required for a request • MC: Message Cost per time unit Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluationa) Retry R Membership Protocol • Motivation: • When a node distributes a request message to R nodes, it might detect some leaving nodes. Therefore, it might not receive exactly R responses. • Solution: • We allow a node to keep sending its message to more than R nodes until it receives exactly R responses. • Our input variables for the Retry R Membership Protocol: • Try: The number of times that a requesting node sends its request message in an attempt to receive R responses. • TryMax: The maximum Try value. Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluationb) Adaptive RR Membership Protocol • Our Churn Estimator is: where • Left: Number of nodes that were detected as non-operational • Joined: Number of nodes that were discovered have joined • NumNodes: Number of requests that a requesting node sent • The Requesting Rate (RR) is: if CE > RRMin / RRMax then RR RRMax x CE else RR RRMin Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluationc) Message Cost vs. Membership Accuracy ? Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation d) Combined Adaptive Membership • Start infinite loop • if current time reaches nextTime • while Try<=2 and resRec < R • make request to (R-resRec) nodes and get responses array • determine left, joined, N, responded from responses array • resRec = resRec + responded • Try = Try + 1 • CE = (left+joined) / (R + R – resRec) • if CE > 1 / RRMax • RR = RRMax x CE • else • RR = 1 Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluation e) Performance Tuning • Combined Adaptive with Try=2, RRMax = 100, 50, 30 Yung-Ting Chuang's Ph.D. Defense
5. Performance Evaluatione) Message Cost vs. Membership Accuracy Yung-Ting Chuang's Ph.D. Defense
6. Extended Scenarioa) Combined Adaptive Membership Protocol Yung-Ting Chuang's Ph.D. Defense
7. Summary • Problem we are trying to solve in this chapter: • We cannot accurately estimate the joining or leaving rates, or maintain an accurate view of the membership when the system has high membership churn • Our solution and contributions: • We presented an adaptive membership management protocol, which uses random sampling to discover newly joining and leaving nodes • Based on the responses it received to its request, a node calculates the churn estimator and dynamically adjusts its requesting rate to update its local view of the membership • Our membership protocol exploits the messages already required by the messaging protocol Yung-Ting Chuang's Ph.D. Defense
Statistical Inference and Dynamic Adaptation for iTrust Background Model for iTrust Dynamic Adaptation Algorithm Performance Evaluation Summary Yung-Ting Chuang's Ph.D. Defense
1. Background • Problems that co-exist in a fully distributed system • High membership churn • Large proportion of malicious nodes • Our approach to address both problems: • Use random sampling • Apply statistical inference techniques to estimate: • Membership churn with a large proportion of malicious nodes • Proportion of malicious nodes in the presence of high membership churn Yung-Ting Chuang's Ph.D. Defense
2. Model for iTrusta) System and Fault Model • We consider the following scenarios • A node leaves the membership voluntarily • A node leaves the membership involuntarily • A malicious node responds to a request but it does not report a match • Parameters for membership churn: • JR: Joining Rate • LR: Leaving Rate • Parameters for detecting malicious nodes: • X: Proportion of non-malicious nodes Yung-Ting Chuang's Ph.D. Defense
3. Dynamic Adaptation Algorithma) Parameters and Variables • n: Size of the node’s current view of the membership • m: Number of nodes to which the metadata are distributed • r: Number of nodes to which the requests are distributed • IE: Intersection estimator obtained by random sampling: • nIE: Estimate of n in I • mIE: Estimate of m in I • rIE: Estimate of r in I • left: Number of nodes that were detected as non-operational • numNodes: Number of requests that a requesting node sent its request nIE mIE rIE Yung-Ting Chuang's Ph.D. Defense