510 likes | 650 Views
Trustworthy Information Distribution and Retrieval. Michael Melliar-Smith University of California, Santa Barbara. Research conducted in collaboration with Louise E. Moser, Isai Michel Lombera and Yung-Ting Chuang
E N D
Trustworthy Information Distribution and Retrieval Michael Melliar-Smith University of California, Santa Barbara Research conducted in collaboration with Louise E. Moser, Isai Michel Lombera and Yung-Ting Chuang Supported in part by NSF Grant CNS 10-16103
InformationAccessovertheInternet • Modern society and commerce depend on access to information over the Internet • Information is accessed over the Internet using centralized search engines and search indexes • Internet search engines are centralized for efficiency and scalability • We cannot assume that centralized search engines will always deliver the information we seek, uncensored and unbiased • iTrust is a system for publishing, searching for, and retrieving information over the Internet that provides trustworthy access to information METIS'2011 iTrust Michael Melliar-Smith
What Is Trust? • "Trust is the mental background to delegation" (Castelfranchi and Falcone [1]) • When you delegate an activity, you trust that: • Good things will happen • Bad things will not happen • However, what is good and what is bad is a matter of your intent • Thus, it is not possible to provide a universal definition of trust • Several useful surveys [2], [3] on trust, related to reliability, security, and privacy METIS'2011 iTrust Michael Melliar-Smith
Trust in Algorithms • Trust in a strong encryption algorithm is based on estimates of the cost of breaking the keys • Publication of the encryption algorithm • Independent validation of the strength of the algorithm • No need to trust the people who created the algorithm • We aim to provide the same kind of trust for iTrust • Published algorithms • Statistical analysis • No need to trust an administrator Trust a social network with a large number of users METIS'2011 iTrust Michael Melliar-Smith
Related Work • Useful surveys [4], [5], [6] on publish/subscribe and distributed search, categorized as: • Structured – Require nodes to be organized in an overlay network, e.g., Distributed Hash Tables (DHTs), rings, trees • More efficient than unstructured • Involve administrative control and additional overhead for constructing and maintaining the overlay network • Still incur a trust risk through administrative control • Unstructured – Gossip-based; typically use randomization • Gnutella [7] - Great grandfather of unstructured systems; uses flooding of requests to find information • Freenet [8] - More sophisticated and efficient than Gnutella, because it learns from previous requests • Zhong and Shen [9] - Uses random walks; number of nodes visited by a request is proportional to the square root of the content popularity • Ferreira, et al. [10] - Uses random walks; replicates both the queries and the data in a sparse network METIS'2011 iTrust Michael Melliar-Smith
Related Work • Several recent information distribution and retrieval systems are concerned with security, privacy, and trust • Quasar [11] – A probabilistic publish/subscribe system, using a sparse structured overlay that is concerned with the release of sensitive information by a central node • OneSwarm [12] – A peer-to-peer data sharing system that uses a combination of trusted and untrusted peers • Part of an effort to provide an alternative to cloud computing that does not depend on centralized trust • Initial goal is to protect the privacy of the users • Uses trusted intermediary nodes to preserve anonymity METIS'2011 iTrust Michael Melliar-Smith
Objectives of iTrust • Main Objective: Provide users with information that is important to them • Publish information for other people to access • Search for, and retrieve, published information • Detect that the system is under attack • When the system is under attack, adapt to increase the probability of information distribution and retrieval, with some increase in costs METIS'2011 iTrust Michael Melliar-Smith
Non-Objectives of iTrust • No attempt to prevent the distribution of misinformation • No attempt to maintain secrecy and privacy • No attempt to minimize communication, processing and storage costs • Costs are greater than for conventional, centralized Internet search engines • We assume that the additional costs are acceptable, given the primary objective • Information distribution and retrieval are probabilistic METIS'2011 iTrust Michael Melliar-Smith
Characteristics of iTrust • Falls into the general category of random walk publish/subscribe systems • All nodes are equal • No central search engine • Replication of metadata and requests • Random search or random walk • No supernodes • Distributed membership • No central control over membership METIS'2011 iTrust Michael Melliar-Smith
Basic Idea of iTrust • The participating nodes constitute the membershipof iTrust • A source nodeis a participating node that • Produces information, which it makes available to other participating nodes • Produces metadata (keywords) describing the information, and distributes the metadata to a subset of participating nodes chosen at random • Arequesting nodeis a participating node that • Generates requests containing metadata and distributes the requests to a subset of participating nodes chosen at random • A participating node that receives a request • Compares the metadata in the request with the metadata it holds • If it finds a match, it returns the URL of the associated information to the requesting node • The requesting node then uses the URL to retrieve the information from the source node METIS'2011 iTrust Michael Melliar-Smith
Source of Information Distribution of Metadata METIS'2011 iTrust Michael Melliar-Smith
Source of Information Request Encounters Metadata Requester of Information Distribution of a Request METIS'2011 iTrust Michael Melliar-Smith
Source of Information Requester of Information Retrieval of Information Request Matched METIS'2011 iTrust Michael Melliar-Smith
Probabilistic Analysis • Membership contains n participating nodes • Metadata are distributed to m nodes • Requests are distributed to r nodes • Of the participating nodes, a proportion x are operational METIS'2011 iTrust Michael Melliar-Smith
Probabilistic Analysis • A request is distributed to r nodes (r trials) • Probability of no match on 1st trial: n-m • Probability of no match on 2nd trial: n-m-1 • Probability of no match on rth trial: n-m-r+1 • Probability q of no match on r trials: n-mn-m-1 … n-m-r+1 = (n-m)! (n-r)! n n-1 n-r+1 n n-1 n-r+1 n! (n-m-r)! METIS'2011 iTrust Michael Melliar-Smith
Probabilistic Analysis • Probability p of a match on r trials: 1 - n-mn-m-1 … n-m-r+1 = 1 - (n-m)! (n-r)! • m=r =n p > 1 – e-1 0.6321 • m=r =2n p > 1 – e-2 0.8647 • m=r =2n p > 1 – e-4 0.9817 • If m+r > n, then p = 1 n n-1 n-r+1 where n ≥ m+r n! (n-m-r)! METIS'2011 iTrust Michael Melliar-Smith
Probabilistic Analysis • Now assume that only a proportion x of the participating nodes are operational • Probability that 1st node has the metadata: m • Probability that 1st node has the metadata and is operational: mx • Probability of no match on 1st trial: 1- mx = n-mx • Probability of no match on 2nd trial: n-mx-1 • Probability of no match on rth trial: n-mx-r+1 n n n n n-1 n-r+1 METIS'2011 iTrust Michael Melliar-Smith
Probabilistic Analysis • Probability q of no match on r trials: n-mxn-mx-1 … n-mx-r+1 = (n-mx)! (n-r)! • Probability p of a match on r trials: 1 - n-mxn-mx-1 … n-mx-r+1 = 1 - (n-mx)! (n-r)! • If mx+r > n, then p = 1 n n-1 n-r+1 n! (n-mx-r)! n n-1 n-r+1 where n ≥ mx+r n! (n-mx-r)! METIS'2011 iTrust Michael Melliar-Smith
Probability of a Match METIS'2011 iTrust Michael Melliar-Smith
Probability of a Match METIS'2011 iTrust Michael Melliar-Smith
Time-to-Live • Metadata can be provided with a time-to-live • Receiver of the metadata deletes the metadata when the time-to-live expires • Similarly, a request can be provided with a time-to-live • Receiver of the request stores the request until the time-to-live and then deletes the request • Receiver attempts to match newly arrived metadata with the metadata in the request until the time-to-live METIS'2011 iTrust Michael Melliar-Smith
Small Information • Many information items are small • Distribute the information itself, rather than the metadata about the information METIS'2011 iTrust Michael Melliar-Smith
Different Classes of Nodes • Some nodes are less capable, or are only intermittently connected to the network • Distribute the metadata and the requests only to the more capable nodes • Less capable nodes might have more powerful proxy nodes or home agents METIS'2011 iTrust Michael Melliar-Smith
Forwarding Metadata and Requests • To exploit the parallelism of the Internet, the originator of the metadata or request does not necessarily send the metadata or request to all m or r of the participating nodes • When a node receives the metadata or request, with some probability, it forwards the metadata or request to another participating node selected at random • Doing so introduces some variability in the number of nodes to which the metadata and requests are distributed METIS'2011 iTrust Michael Melliar-Smith
Differential Distribution • If there are many more requests than metadata, it might be appropriate to distribute the metadata to more nodes and the requests to fewer nodes • Similarly, long-lived metadata and requests might be distributed to more nodes than short-lived metadata and requests • Likewise, frequently requested metadata might be distributed to more nodes than rarely requested metadata METIS'2011 iTrust Michael Melliar-Smith
Network Load • We are investigating the effects of metadata and request distribution on the network load and also on the load of participating nodes • If the network load is too high, it might be necessary to reduce the number of nodes to which the metadata and requests are distributed METIS'2011 iTrust Michael Melliar-Smith
Membership • The membership of participating nodes need not be exact and up-to-date • Small differences in the membership are equivalent to small proportions of non-operational nodes • It is essential, to the iTrust strategy, that the membership should not be centrally managed • Thus, we employ a membership algorithm that is based on iTrust itself METIS'2011 iTrust Michael Melliar-Smith
Membership Algorithm • A node wishing to join the membership contacts any current member to obtain the current membership • It does so using mechanisms that are outside the iTrust strategy, perhaps Email, Twitter, etc. • It then publishes its joining the membership, through the iTrust distribution and retrieval mechanisms • All nodes periodically request and retrieve information about new nodes that have joined the membership METIS'2011 iTrust Michael Melliar-Smith
Joining the Membership METIS'2011 iTrust Michael Melliar-Smith
Discovering New Members METIS'2011 iTrust Michael Melliar-Smith
Leaving the Membership METIS'2011 iTrust Michael Melliar-Smith
Rapidly Changing Memberships • At times of rapid membership change, it might be appropriate to request and retrieve membership information more frequently, with increased computation and communication costs • At times of rapid membership change, it might be appropriate to distribute the metadata and requests to more nodes to compensate for inaccurate membership information METIS'2011 iTrust Michael Melliar-Smith
Large Memberships • Large memberships (perhaps millions of nodes) might be expensive to retrieve and store • The potentially high rate of notifications of membership changes for a large membership might impose a heavy load on the network • We are investigating strategies for creating and maintaining memberships in which each node is aware of only a small subset of the membership • We are also investigating the effects of small subsets of the membership, on the effectiveness of information distribution and retrieval METIS'2011 iTrust Michael Melliar-Smith
Encryption • In iTrust, there is no intention to use encryption to ensure secrecy or privacy at the node level • Necessarily, metadata and requests must be readable by large numbers of nodes and, thus, they are public • However, encryption can be used to make it prohibitively expensive for routers to use deep packet inspection to censor metadata or requests • For this purpose, iTrust uses standard public key encryption METIS'2011 iTrust Michael Melliar-Smith
Encrypted Metadata and Requests • When a node sends metadata or requests, it encrypts the message with • Its private key • Destination’s public key • The sending node includes its public key in the message • Some receiver nodes might not yet have its information in their membership tables • When a node finds that a request matches its metadata, • It uses the public key in the request to encrypt the response reporting the match to the requester • The response supplies the URL that directs the requester to the source of the information, and also the source’s public key METIS'2011 iTrust Michael Melliar-Smith
Potential Malicious Attacks • A malicious attacker might seed the network with covertly subverted nodes that behave normally, except that they fail to report matches involving information that the attacker wants to suppress • A malicious attacker must ensure that a large number of nodes that participate in matching have been subverted • In iTrust, it is important to detect a malicious attack, and to prevent the malicious attack from being effective METIS'2011 iTrust Michael Melliar-Smith
Detecting a Malicious Attack • In iTrust, it is likely that a request will result in several reports of matches • The probability of multiple reports of matches depends on: • Number n of participating nodes • Number m of nodes to which the metadata are distributed • Number r of nodes to which the requests are distributed • Probability x that a node is operational • The effect of a malicious attack is to increase the probability that a substantial number of subverted nodes appear to be non-operational for certain metadata or requests METIS'2011 iTrust Michael Melliar-Smith
Detecting a Malicious Attack 0.6 1000 Node Network with Distribution to 60 Nodes Percentage of nodes operational 100% 0.5 80% 0.4 60% Probability of Number of Matches 0.3 40% 20% 0.2 0.1 0 k=0 k=1 k=2 k=3 k=4 k=5 Number of Matches METIS'2011 iTrust Michael Melliar-Smith
Responding to a Malicious Attack • If the iTrust network is under attack, it is appropriate to increase the number of nodes to which the metadata and requests are distributed • We are investigating an adaptive algorithm that increases the number of nodes to which the metadata and requests are distributed, as the probability of an attack increases, i.e., our estimate of the number of subverted nodes increases METIS'2011 iTrust Michael Melliar-Smith
Prototype Implementation of iTrust • Based on the Apache Web server, compiled with several PHP standard modules and library extensions • Uses HTTP for distribution of metadata and requests, and for retrieval of documents • Multiple iTrust nodes can be installed on a single Web server by creating multiple virtual Web sites on that server • Comprises three components: • Web Server Foundation • Application Infrastructure • Public Interface METIS'2011 iTrust Michael Melliar-Smith
The Components of iTrust METIS'2011 iTrust Michael Melliar-Smith
Web Server Foundation • cURL is used for inter-node communication and resource-specific actions • SQLite database tables are used to store node, metadata, and resource information • A node uses the SQLite LIKE function to match the metadata in a request with the metadata that it holds • The session module tracks and distinguishes users • The log module is used for debugging and for simulation • The PHP Extension Community Library (PECL) for HTTP is used for inter-node search and requests METIS'2011 iTrust Michael Melliar-Smith
Application Infrastructure • The metadata XML engine scans the resources and creates an XML list describing the relationship between the metadata and the resource • The node and resource-related helper functions insert nodes into the membership, insert keywords into the database, and upload or fetch resources • The Apache Tika and Lucene packages are used to generate metadata from resources, if the user opts not to generate the metadata manually • The WordNet dictionary is used to provide spell checking and synonym suggestions METIS'2011 iTrust Michael Melliar-Smith
Public Interface • Comprises two kinds of interfaces: • Computer interfaces – Handle all inter-node communication such as queries, resource distribution, and metadata list distribution • Request is sent to participating nodes using computer interfaces in a simple inbox-type fashion • A participating nodes reads its inbox for queries, and sends back a response if it has a match • Human interfaces – Consist of PHP HTML Web pages • Administrator can add nodes or metadata keywords using HTML form text boxes • User generates requests using HTML form text boxes • User settings and statistics Web pages provide feedback on the membership size, resource count, etc. METIS'2011 iTrust Michael Melliar-Smith
Prototype Implementation of iTrust METIS'2011 iTrust Michael Melliar-Smith
Prototype Implementation of iTrust METIS'2011 iTrust Michael Melliar-Smith
Prototype Implementation of iTrust METIS'2011 iTrust Michael Melliar-Smith
Simulation Results Based on the Implementation METIS'2011 iTrust Michael Melliar-Smith
Conclusion and Future Work • We have described iTrust [13], [14], a trustworthy information distribution and retrieval network • We plan to do experimental evaluations of the prototype implementation using PlanetLab • We are investigating other implementations of iTrust based on: • SMS • Twitter • What else? We need your advice. • We plan to make the iTrust source code, tools, documentation, etc. freely available for all to use METIS'2011 iTrust Michael Melliar-Smith
References [1] C. Castelfranchi and R. Falcone, Principles of Trust for MAS: Cognitive Anatomy, Social Importance, and Quantification, Proceedings of the International Conference on Multi-Agent Systems, Paris, France, 72-79. [2] D. Artz and Y. Gil, A Survey of Trust in Computer Science and the Semantic Web, Journal of WebSemantics, 5:58-71, Elsevier, 2007. [3] T. Grandison and M. Sloman, A Survey of Trust in Internet Applications, IEEE Communications Survey Tutorials, 4(4):2-16, 2000. [4] P. T. Eugster, P. A. Felber, R. Guerraoui and A. M. Kermarrec, The Many Faces of Publish/Subscribe, ACM Computing Surveys, 35(2):114-131, June 2005. [5] J. Mischke and B. Stiller, A Methodology for the Design of Distributed Search in P2P Middleware, IEEE Network 19(1):30-37, January 2004. [6] J. Risson and T. Moors, Survey of Research Towards Robust Peer-to-Peer Networks: Search Methods, Technical Report UNSW-EE-P2P-1-1. University of New South Wales, September 2007, RFC 4982, http://tools.ietf.org/html/rfc-4821. [7] Gnutella, http://gnutella.wego.com/. [8] I. Clarke, O. Sandberg, B. Wiley and T. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System, Proceedings of the Workshop on Design Issues in Anonymity and Unobservability, Lecture Notes in Computer Science, Berkeley, CA, July 2000, 46-66. [9] M. Zhong and K. Shen, Popularity-Based Random Walks for Peer-to-Peer Search under the Square-Root Principle, Lecture Notes in Computer Science 4490, 2007, 877-880. [10] R. A. Ferreira, M. K. Ramanathan, A. Awan, A. Grama and S. Jagannathan, Search with Probabilistic Guarantees in Unstructured Peer-to-Peer Networks, Proceedings of the Fifth IEEE International Conference on Peer-to-PeerComputing, Konstance, Germany, August 2005, 165-172. [11] B. Wong and S. Guha, Quasar: A Probabilistic Publish-Subscribe System for Social Networks, Proceedings of the 7th International Workshop on Peer-to-Peer Systems, Tampa Bay, FL, February 2008. [12] T. Isdal, M. Piatek, A. Krishnamurthy and T. Anderson, Privacy Preserving P2P Data Sharing with OneSwarm, Technical Report UW-CSE, Department of Computer Science, University of Washington, 2009. [13] I. Michel Lombera, Y. T. Chuang, P. M. Melliar-Smith and L. E. Moser, Trustworthy Distribution and Retrieval of Information over HTTP and the Internet, Proceedings of the Third International Conference on the Evolving Internet,INTERNET 2011, Luxembourg, June 2011. [14] Y. T. Chuang, I. Michel Lombera, L. E. Moser and P. M. Melliar-Smith, Trustworthy Distributed Search and Retrieval over the Internet, Proceedings of the 2011 WORLDCOMP, International Conference on Internet Computing,ICOMP, Las Vegas, NV, July 2011. METIS'2011 iTrust Michael Melliar-Smith