Trustworthy Distributed Search and Retrieval over the Internet

Trustworthy Distributed Search and Retrieval over the Internet Yung-Ting Chuang Electrical and Computer Engineering University of California, Santa Barbara May 3, 2013 Committee Members: Professor P. Michael Melliar-Smith, Chair Professor Louise E. Moser Professor Timothy P. Sherwood Professor Volkan Rodoplu Yung-Ting Chuang's Ph.D. Defense

Outline • Motivation • Trustworthy Distributed Search and Retrieval • Protecting against Malicious Attacks in iTrust • Membership Management for iTrust • Statistical Inference and Dynamic Adaptation for iTrust • Conclusions and Future Work Yung-Ting Chuang's Ph.D. Defense

Motivation • Information is accessed over the Internet using centralized search engines • Benefits - efficient, robust, and scalable • Drawbacks – depends on administrators remaining benign • Thus, we present a decentralized and distributed search and retrieval system • Benefits – prevent censorship and filtering of information • Drawbacks – • Need more network bandwidth • Difficult to infer membership size and malicious nodes Yung-Ting Chuang's Ph.D. Defense

Trustworthy Distributed Search and Retrieval Related Work Design of iTrust Implementation of iTrust User Interface of iTrust Performance Evaluation of iTrust Summary Yung-Ting Chuang's Ph.D. Defense

1. Related Work • Survey by Mischeke and Risson on distributed search: • Structured – Require nodes to be organized in an overlay network • Distributed Hash Table (DHT), Ring, Tree, Skip Lists • Unstructured – Typically gossip-based, and use randomization • Flooding / Broadcast => Gnutella • Random walk and data replication => Sarshar, GIA, Lv • Key-based routing => Freenet • Direct routing => Pub-2-Sub • Square root function => Cohen, Zhong, Ferreira • P2P systems concerned with security, privacy, and trust • Quasar–Uses a structured overlay and protects user’s sensitive information • OneSwarm– Uses a combination of trusted and untrusted peers and protect the privacy of the users • GOSSPLE – Fully decentralized system for social acquaintances using a gossip protocol. Yung-Ting Chuang's Ph.D. Defense

Source of Information 2. Design of iTrusta) Distribution of Metadata Yung-Ting Chuang's Ph.D. Defense

Source of Information Request Encounters Metadata Requester of Information 2. Design of iTrust b) Distribution of a Request Yung-Ting Chuang's Ph.D. Defense

Source of Information Requester of Information 2. Design of iTrust c) Retrieval of Information Request Matched Yung-Ting Chuang's Ph.D. Defense

3. Implementation of the iTrust System Yung-Ting Chuang's Ph.D. Defense

4. User Interface of iTrust Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation of iTrusta) Analytical Model • Notation • Membership contains n participating nodes • x is the proportion of participating nodes that are operational • Metadata are distributed to m nodes • Requests are distributed to r nodes • k nodes report matches to a requesting node (for the same metadata and the same request) Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation of iTrusta) Analytical Model • Probability of k matches is: • Probability of one or more match is: Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation of iTrusta) Analytical Model Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation of iTrustb) Analysis vs. Emulation Yung-Ting Chuang's Ph.D. Defense

6. Summary • Problem we are trying to solve: • Centralized search engines can be tampered with to bias the results, or to conceal or censor information • Our solutions and contributions: • We have implemented iTrust, which is a decentralized distributed search and retrieval system with no centralized mechanisms and no centralized control • We have demonstrated that the match probability is high, even if some participating nodes are subverted or non-operational Yung-Ting Chuang's Ph.D. Defense

Protecting against Malicious Attacks in iTrust Background Related Work Foundations Detecting Malicious Attacks Defending against Malicious Attacks Performance Evaluation Summary Yung-Ting Chuang's Ph.D. Defense

1. Background • Potential attacks: • Nodes do not match requests • Nodes do not return responses to requester • Effect of such attacks • Probability of a match is decreased • Existing work that addresses attacks: • Place nodes on a blacklist (Jesi) • Maintains a reputation or trust score (Condie) • Our solution to such attacks is: • Estimate the proportion of malicious nodes • Increase the number of nodes to which requests are distributed in order to restore match probability Yung-Ting Chuang's Ph.D. Defense

2. Related Work • Work related to our detection algorithm • Exponential Weighted Moving Average (EWMA) • Roberts et al. - For discovering anomalies and issuing alerts • Chi-squared test • Goonatilake - For detecting intrusions • Press et al. - For balancing weights of buckets • Belen and Heckert – For determining similarity between two models • EWMA and Chi-squared test • Ye and Chen - For anomaly detection and intrusion detection • Work related to our defensive adaptation algorithm: • Morselli – Uses feedback mechanism to adjust the replicas to improve search result • Leng – Uses maintainer to determine, update, and eliminate the data replicas Yung-Ting Chuang's Ph.D. Defense

3. Foundationsa) Normalization • We cannot use requests that return k=0 responses • Because there might be no metadata to match • Probability of k matches is negligibly small, when k is large • Thus, we exclude requests for k=0 and for k > K • Our normalization equation is: • where Yung-Ting Chuang's Ph.D. Defense

3. Foundationsb) Exponential Weighted Moving Average • The EWMA method is computed as follows: where c is the weighting factor for the EWMA method Yung-Ting Chuang's Ph.D. Defense

3. Foundationsc) Chi-Squaredvs. Modified Chi-Squared • Pearson’s chi-squared statistic: • Pearson’s modified chi-squared statistic: where: • ok: the actual number of observations that fall into kth bucket • ek: the expected number of observations for the kth bucket • K: the number of buckets into which the observations fall Yung-Ting Chuang's Ph.D. Defense

3. Foundationsd) Chi-Squared vs Modified Chi-Squared Yung-Ting Chuang's Ph.D. Defense

4. Detecting Malicious Attacksa) Detection Algorithm • Collects responses for its request using EWMA method • Normalize empirical probabilities • Uses modified chi-squared test to compare the empirical probabilities against the analytical probabilities for x=1.0, 0.7, 0.4, and 0.2 • Chooses the smallest value of chi-squared to estimate x’ Yung-Ting Chuang's Ph.D. Defense

4. Detecting Malicious Attacksb) Example Yung-Ting Chuang's Ph.D. Defense

5. Defending against Malicious Attacksa) Defensive Adaptation Algorithm • Initialize r  0 • Calculate yo based on current r with given n, m, and x. • Determine whether the yo is greater than the expected match probability. • If not, increase r by 1 and go back to step 2 • If so, return r Yung-Ting Chuang's Ph.D. Defense

5. Defending against Malicious Attacksb) Example Yung-Ting Chuang's Ph.D. Defense

6. Performance Evaluationa) Varying the number of nodes Yung-Ting Chuang's Ph.D. Defense

6. Performance Evaluation Yung-Ting Chuang's Ph.D. Defense

7. Summary • Problem we are trying to solve in this chapter: • Absence of centralized control makes it difficult to determine the proportion of non-operational nodes in the network • Our solution and contributions: • A node can estimate the proportion of non-operational nodes in the network based on the responses to its requests • A node calculates the number of nodes to which the requests are distributed to maintain a high match probability • A node infers useful but unobservable information about the network as a whole by observing aspects of the behaviors of individual nodes that are visible to it Yung-Ting Chuang's Ph.D. Defense

Membership Management for iTrust Background Related Work iTrust Membership Protocols Foundations Performance Evaluation Extended Scenario Summary Yung-Ting Chuang's Ph.D. Defense

1. Background • Churn – Nodes joining and leaving the membership • Challenging tasks • Estimating membership and membership size • Estimating churn • Existing work that addresses churn: • Passive Monitoring (Sen et al., Gummadi et al.) • Active Probing (Chu et al., Liang, Bhagwan et al.) • Gossiping (Bizenhofer, Pruteanuet al) • Our approach to address churn: • Nodes don’t predict churn characteristics in advance • Each node maintains its local view of the membership and uses statistical inference to update its view Yung-Ting Chuang's Ph.D. Defense

2. Related Work • Work related to membership management: • Zage – Biases neighbor selections toward beneficial nodes • SCAMP – Nodes discover joining and leaving nodes through gossiping • CYCLON – Nodes maintain a small and fixed-size neighbor list, with a shuffling protocol for large networks • Newcast – Each node periodically selects a peer to exchange and update its membership list • Work related to churn: • Bizenhofer and Pruteanu et al. - Estimate the churn rate through gossiping • Stutzbach & Rejaie - Study churn characteristics, highlight problems that cause biased peer selections. • Paulo et al. – Maintains dynamic mapping of flows according to the current set of neighbors • Liu – Presents an age-based membership protocol with a conservative neighbor maintenance scheme under churn • Horowitz et al. – Relies on the departure and arrival of nodes to estimate the current network size, without requiring any additional communication Yung-Ting Chuang's Ph.D. Defense

Joining Node 3. iTrust Membership Protocolsa) Joining the Membership Bootstrapping Node Yung-Ting Chuang's Ph.D. Defense

3. iTrust Membership Protocolsb) Leaving the Membership Leaving Node Yung-Ting Chuang's Ph.D. Defense

3. iTrust Membership Protocolsc) Distributing Metadata Discover New Node Discover Leaving Node Source Node Yung-Ting Chuang's Ph.D. Defense

3. iTrust Membership Protocolsd) Distributing Requests Redistribute Metadata Discover Leaving Node Discover New Node Requesting Node Yung-Ting Chuang's Ph.D. Defense

4. Foundationsa) Metrics • LND: Leaves Not Detected • JND: Joins Not Detected • MA: Membership Accuracy • MP: Match Probability for a request • RT: Response Time required for a request • MC: Message Cost per time unit Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluationa) Retry R Membership Protocol • Motivation: • When a node distributes a request message to R nodes, it might detect some leaving nodes. Therefore, it might not receive exactly R responses. • Solution: • We allow a node to keep sending its message to more than R nodes until it receives exactly R responses. • Our input variables for the Retry R Membership Protocol: • Try: The number of times that a requesting node sends its request message in an attempt to receive R responses. • TryMax: The maximum Try value. Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluationb) Adaptive RR Membership Protocol • Our Churn Estimator is: where • Left: Number of nodes that were detected as non-operational • Joined: Number of nodes that were discovered have joined • NumNodes: Number of requests that a requesting node sent • The Requesting Rate (RR) is: if CE > RRMin / RRMax then RR RRMax x CE else RR  RRMin Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluationc) Message Cost vs. Membership Accuracy ? Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation d) Combined Adaptive Membership • Start infinite loop • if current time reaches nextTime • while Try<=2 and resRec < R • make request to (R-resRec) nodes and get responses array • determine left, joined, N, responded from responses array • resRec = resRec + responded • Try = Try + 1 • CE = (left+joined) / (R + R – resRec) • if CE > 1 / RRMax • RR = RRMax x CE • else • RR = 1 Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluation e) Performance Tuning • Combined Adaptive with Try=2, RRMax = 100, 50, 30 Yung-Ting Chuang's Ph.D. Defense

5. Performance Evaluatione) Message Cost vs. Membership Accuracy Yung-Ting Chuang's Ph.D. Defense

6. Extended Scenarioa) Combined Adaptive Membership Protocol Yung-Ting Chuang's Ph.D. Defense

7. Summary • Problem we are trying to solve in this chapter: • We cannot accurately estimate the joining or leaving rates, or maintain an accurate view of the membership when the system has high membership churn • Our solution and contributions: • We presented an adaptive membership management protocol, which uses random sampling to discover newly joining and leaving nodes • Based on the responses it received to its request, a node calculates the churn estimator and dynamically adjusts its requesting rate to update its local view of the membership • Our membership protocol exploits the messages already required by the messaging protocol Yung-Ting Chuang's Ph.D. Defense

Statistical Inference and Dynamic Adaptation for iTrust Background Model for iTrust Dynamic Adaptation Algorithm Performance Evaluation Summary Yung-Ting Chuang's Ph.D. Defense

1. Background • Problems that co-exist in a fully distributed system • High membership churn • Large proportion of malicious nodes • Our approach to address both problems: • Use random sampling • Apply statistical inference techniques to estimate: • Membership churn with a large proportion of malicious nodes • Proportion of malicious nodes in the presence of high membership churn Yung-Ting Chuang's Ph.D. Defense

2. Model for iTrusta) System and Fault Model • We consider the following scenarios • A node leaves the membership voluntarily • A node leaves the membership involuntarily • A malicious node responds to a request but it does not report a match • Parameters for membership churn: • JR: Joining Rate • LR: Leaving Rate • Parameters for detecting malicious nodes: • X: Proportion of non-malicious nodes Yung-Ting Chuang's Ph.D. Defense

3. Dynamic Adaptation Algorithma) Parameters and Variables • n: Size of the node’s current view of the membership • m: Number of nodes to which the metadata are distributed • r: Number of nodes to which the requests are distributed • IE: Intersection estimator obtained by random sampling: • nIE: Estimate of n in I • mIE: Estimate of m in I • rIE: Estimate of r in I • left: Number of nodes that were detected as non-operational • numNodes: Number of requests that a requesting node sent its request nIE mIE rIE Yung-Ting Chuang's Ph.D. Defense

Trustworthy Distributed Search and Retrieval over the Internet

Trustworthy Distributed Search and Retrieval over the Internet

Presentation Transcript

Information Retrieval and Search Engines

THE SEARCH IS OVER Internet Marketing for Tourism

Multimedia Search and Retrieval

DISTRIBUTED INFORMATION RETRIEVAL

Information Retrieval and Web Search

Technology Supports for distributed and collaborative learning over the internet

Distributed Information Retrieval

Indexing and Retrieval Semantic Search

TRUSTWORTHY SERVIES ALL OVER THE WORLD

Data Search and Retrieval

Distributed Instance Retrieval over Heterogeneous Ontologies

Server Ranking for Distributed Text Retrieval Systems on the Internet

Component Search and Retrieval

Distributed Search over the Hidden Web:

Distributed Push-To-Talk over internet networks

Trustworthy Distributed Search and Retrieval over the Internet

XML Distributed Retrieval

Search For Internet Providers All Over The USA

Information Retrieval and Web Search

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

Parallel and Distributed Information Retrieval