300 likes | 398 Views
Adaptive Content Management in Structured P2P Communities. Jussi Kangasharju Keith W. Ross David A. Turner. Content. Introduction Related Works Adaptive Algorithms Experimental Results Optimization Theory Conclusion. Introduction (1).
E N D
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner
Content • Introduction • Related Works • Adaptive Algorithms • Experimental Results • Optimization Theory • Conclusion
Introduction (1) • P2P file sharing is the dominant traffic type in the Internet • Two types of P2P system • Unstructured, e.g. KaZaA and Gnutella • Nodes are not organized into highly-structured overlays • Content is randomly assigned to nodes • Structured, e.g. CAN, Chord • Distributed hash table (DHT) substrates are used • Nodes are organized into highly-structured overlays • Keys are deterministically assigned to nodes
Introduction (2) • Assume the system is DHT-based P2P file-sharing communities • P2P community: a collection of intermittently-connected nodes • Nodes: contribute storage, content and bandwidth to the rest of the community • A node in the community wants a file • Retrieve the file from the other nodes in the community • If the file is not found, the community retrieves the file from outside • The file will be cached and a copy will be forwarded to the requesting node
Introduction (3) • Address the problem of content management in P2P file sharing communities • Propose algorithms to adaptively manage content • Minimize the average delay: the time from when a node makes a query for a file until the node receives the file in its entirety. • File transfer delay >> lookup delays • Intra-community file transfers occur at relatively fast rates as compared with file transfers into the community
Introduction (4) • PROBLEM is equivalent to “adaptively managing content to maximize intra-community hit rates” • Replication: how should content be replicated to provide satisfactory hit rates • Replacement: how does a node determine to keep/evict the files • Contributions • Algorithms for dynamically replicating and replacing files in a P2P community • No a priori assumptions about file request rate or nodal up probabilities • Simple, adaptive and fully distributed • Analytical optimization theory to benchmark the adaptive replication algorithms • For complete-file replication • For the case when files are segmented and erasure codes are used
Related Works • Squirrel [8] • Distributed, server-less, P2P web caching system • Built on top of the Pastry DHT substrate • Focus on the protocol design and implementation • Not address the issues of replication and file replacement • In [13] and [14], it studied optimal replication in an unstructured peer-to-peer network • Reduce random search times
DHT Substrate • Node has access to the API of a DHT substrate • The substrate takes a file j as input and determines an ordered list of the up nodes • For a given value of K, (i1, i2,…,iK) • i1 is the first-place winner for file j
LRU Algorithms (1) • Fundamental Problem: • “How can we adaptively add and remove replicas, in a distributed manner and as a function of evolving demand, to maximize the hit probability?” • Suppose X is a node that wants file j • Basic LRU Algorithm • X uses the substrate to determine i1, the first place winner for j • If i1 doesn’t have j, i1 retrieves j from outside the community and copies the file in storage • If i1 needs to make room for j, LRU replacement policy is used • i1 sends j to X • X does not put j in its storage
LRU Algorithms (2) • Basic LRU Algorithm • A request can be a “miss” even when the file is cached in some up node within the community • Top-K LRU Algorithm • When i1 doesn’t have j, i1 determines i2,…,iK and pings each of these K-1 nodes to see if any of them have j • If so, i1 retrieves j from one of the nodes and puts a copy in its storage • Otherwise, i1 retrieves j from outside the community • The algorithm replicates content • Without any a priori knowledge of request patterns or nodal up probabilities • Fully distributed
Observations • Top-K LRU algorithm is simple but its performance is significantly below the theoretical optimal • Observed that • LRU let unpopular file linger in nodes. Intuitively, if we do not store the less popular files, the popular files will have more replicas • Searching more than one node is needed to find files under the file-sharing system
MFR Algorithm (1) • Most Frequently Requested (MFR) Algorithm • Has near optimal performance • Each node i maintains an estimate of aj(i), the local request rate for file j • aj(i) is the number of requests that node i has seen for file j divided by the amount of time node i has been up • Each node i stores the files with the highest aj(i) values, packing in as many files as possible
MFR Algorithm (2) • MFR retrieval and replacement policy • Node i receives a request for file j, it updates aj(i) • If i doesn’t have j and MFR say it should, i retrieves j from the outside and puts j in its storage • If i needs to make room for j, MFR replacement policy is used • Searching more than one node is needed • “Ping” dynamics to influence aj(i) so that the number of replicas across all nodes become nearly optimal
MFR Algorithm (3) • “Ping” the top-K winners in parallel • Retrieve the file from any node that has the file • Each “Ping” could be considered a request • Nodes update their request rate and manage their storage with MFR • However, this approach doesn’t give better performance • Sequentially request j from the top-K winners • Stop the sequential requests once j is found
Experiment Results (1) • Run simulation experiments • 100 nodes and 10000 files • Request probabilities follow a Zipf distribution with parameters 0.8 and 1.2 • All file sizes are the same • Each node contributes the same amount of storage • Measure the hit performance of the algorithm
Experiment Results (2) • LRU performs better than non-cooperative algorithm but significantly worse than the theoretical optimal
Experiment Results (4) • Using a K greater than 1 improves the hit probability • K beyond 5 gives insignificant improvement
Experiment Results (5) • The number of replicas is changing over time, the graphs report the average values • The optimal scheme replicates the more popular files much more aggressively • The optimal scheme does not store the less popular files
Experiment Results (7) • The MFR algorithm is very close to optimal • Thus, the hit rates also are very close to optimal
Analysis of MFR (1) • Analytical procedure for calculating the steady-state replica profile and hit probability for Top-K MFR for the case K=I • The results still serve as excellent approximations for when K is small • Assume • I is the number of nodes • J is the number of distinct files • pi is “up” probability of node i • Si is the amount of shared storage (in bytes) in node i • bj is the size (in bytes) of file j • qj is the request probability for file j • The request probability for the J files are known
Analysis of MFR (2) • The procedure sequentially places copies of files into the nodes • Ti is the remaining unallocated storage • xij is equal to 1 if a copy of file j has been placed in node i • Initializes Ti=Si, xij=0 and vj=qj/bj • Find file j that has the largest value of vj • Sequentially examine the winning nodes for j until a node is found such that Ti>=bj and xij=0 • Set xij=1; • Set vj=vj(1-pi); • Set Ti=Ti-bj • If there is no node such that Ti>=bj and xij=0, remove file j from further consideration • Return to Step 1 if all files have not been removed from consideration
Optimization Theory (1) • Analytical theory for optimal replication in P2P communities • Complete-File Replication (No Fragmentation) • File are segmented and erasure coded • No Fragmentation Subject to
Optimization Theory (2) • The problem is NP-complete • Consider a special case • pi=p • nj=number of replicas for file j Subject to • The problem can be efficiently solved by dynamic programming
Optimization Theory (3) • Upper bound on the performance of adaptive management algorithms for the case of erasures • File j is made up of Rj erasures • Any Mj of the Rj erasures are needed to reconstruct the file • Size of each erasure is bj/Mj • Assume homogenous “up” probabilities, pi=p • rth erasure of file j as erasure jr, r=1,…,Rj • njr is the number of erasures jr stored in the community of nodes
Optimization Theory (4) 0-1 random variable which is 1 if any of the njr erasures jr is in some up node A hit for a request for file j if any Mj of the Rj erasures for file j are available
Optimization Theory (5) • Theorem 2.2 of Boland et al [24], the function is Shur concave Subject to
Optimization Theory (6) • Special case: No erasures • Rj=Mj=1 where • qj/bj plays a key role in influencing the number of replicas • It is upper bound on the true optimal because it is the optimal over continuous variables rather than integer variables
Conclusion • Claim that structured/DHT-designs will potentially improve search and download performance • Proposed the Top-K MFR algorithm, which is simple, fully distributed, adaptive, near-optimal • Introduce an optimization methodology for bench-marking the performance of adaptive algorithms • The methodology can also be applied to designs that use erasures