380 likes | 494 Views
P2P systems: epidemic scheduling, content placement and user profiling. Laurent Massoulié Thomson, Paris Research Lab. Outline. Epidemic schemes for live streaming Rate-optimality Delay-optimality Content placement Optimisation framework Adaptive replication User profiling
E N D
P2P systems:epidemic scheduling, content placement and user profiling Laurent Massoulié Thomson, Paris Research Lab
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication • User profiling • Spectral clustering • Linear programming
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication and 3/4 - competitivity • User profiling • Spectral clustering • Linear Programming
Context • P2P systems for live streaming on the Internet • PPLive, CoolStreaming, Sopcast, TVants,TVUPlay, Joost…
Network constraints • Graph connecting nodes • Capacities assigned to edges • Achievable broadcast rate [Edmonds, 73]: • Equals maximal number of edge-disjoint spanning trees that can be packed in graph • Coincides with minimum over receivers of max-flow ( = min-cut) between source and receiver
Random Useful chunk selection and Edmonds’ theorem[LM, A. Twigg, C. Gkantsidis & P. Rodriguez] 1 2 4 5 7 8 Based on local informations No explicit construction of spanning trees 5 1 4 When injection rate at source is strictly feasible, Markov process is ergodic. Chunks successfully broadcast with bounded delay ? ? ? ? ? ? ? ? ?
Network with access (node) constraints • Scarce resource: access capacity • Complete communication graph: Everyone can send to anyone • Bound on maximum streaming rate λ: Let ci= uplink b/w of node i Necessary condition for feasibility: …
Deprived Peer / Random Useful Chunk [LM, A. Twigg, C. Gkantsidis & P. Rodriguez] Sender’s packets 1 2 4 5 7 8 5 1 5 7 8 1 4 Potential receiver 1 Potential receiver 2 Source policy: sends “fresh” packets if any (fresh = not sent yet to anyone)
Deprived Peer / Random Useful Chunk [LM, A. Twigg, C. Gkantsidis & P. Rodriguez] Sender’s packets 1 2 4 5 7 8 1 5 7 8 1 4 5 Potential receiver 1 Potential receiver 2 Neighborhood management: Periodically add random neighbor & suppress least deprived neighbor Fixed neighborhood sizes
Main result • When λ < λ* , Markov process is ergodic. • Hence all packets are received at all nodes after time bounded in probability
Multiple commodities • Several sources s, • Dedicated receiver sets V(s) • Can overlap • Sources are not receivers • Nodes cannot relay commodities they don’t consume …
Multiple commodities • Necessary conditions for feasibility: • Bundled most deprived / random useful: do not distinguish between commodities when • measuring deprivation • Chosing random useful packet System is ergodic when Conditions hold with strict inequality
Symmetric Networks (c1 = c2 = ... = cN = 1 chunk / sec ) • Previous lower bound reads log2(N) • Achievable [J. Mundinger & R. Weber]: t t-1 t-1 source t-2 t-2 t-2 t-2 t+1 t-3 t-3 t-3 t-3 t-3 t-3 t-3 t-3 Makes use of log2(N) trees; not robust against churn
A look at the corresponding trees N=4 N=8 N=16 N=32
Random target / latest useful packet Sender’s packets 1 2 4 5 7 8 Latest useful pkt ? 1 ? 2 ? 3 ? 8 Receiver’s packets
Random target / latest useful packet [T. Bonald, LM, F. Mathieu, D. Perino & A. Twigg] I.e: Diffusion at rates arbitrarily close to optimal feasible under optimal delay ( plus constant) For arbitrary injection rate λ<1 and constant x>0, Each peer receives fraction 1- 1/x of packets in time log2(N)+O(x).
Open questions • Delay optimality in heterogeneous environments • Cost optimality • Convergence time scale
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication • User profiling • Spectral clustering • Linear programming
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication • User profiling • Spectral clustering • Linear programming
Problem statement • N users • Storage capacity: m objects • Service capacity: B requests • Local accesses are free • Request rate: f for object f • Request duration: 1 • Aim: minimize number of lost requests
Optimal placement structure • Let Mf= number of replicas of object f • Schedulable region: request rates xf verifying • Effective arrival rates: times K if objects can be split into K size (1/K) sub-objects
Hot/Warm/Cold partition • Sort objects according to popularity : 12 … • Replicate everywhere (Mf=N) top popular objects 1…,f(1) • Partial replication of objects f(1)+1,…f(2) : • No replication of objects for f>f(2) • f(1) and f(2) : such that “warm objects” generate requests at rate BN, and all memory is used
Adaptive replication • Replication policy: • Create new replica for object f after each dropped request • Remove object chosen at random • Ignoring object-specific capacity constraints, caricature dynamics: Equilibrium:
Adaptive replication (ctd) • Compare to full replication of only top popular objects, i.e. • Then reductions to offered rates verify “Value of foresight” is less than 25%...
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication • User profiling • Spectral clustering • Linear programming
Outline • Epidemic schemes for live streaming • Rate-optimality • Delay-optimality • Content placement • Optimisation framework • Adaptive replication • User profiling • Spectral clustering • Linear programming
User profiling • Aim: predict tastes of users • Applications: • Further optimization of placement • Recommender Systems
Netflix dataset 17, 770 movies, rated by 480, 000 users
The planted partition model • Userspartitioned into clusters k=1,…,K • Each pair of users (i,j) : conflict level C(i,j) in [0,1] (e.g., fraction of movies rated differently) • Statistical assumptions: • C(i,j) independent over i<j • E(C(i,j)) = bkl D/Nif users i,j belong clusters k, l
A spectral algorithm Step 1: find suitable “de-noised” descriptors of users Form normalized eigenvectors x(1),…,x(K)associated to K largest (in absolute value) eigenvalues of conflict matrix To each user i, assign vector zi=(xi (1),…,xi (K))
A spectral algorithm Step 2: do crude clustering on descriptors Pick a random set of A users u(1),…,u(A) Identify pair with closest descriptors (for L2 norm) and remove one of them, until only K users are left, say v(1),…,v(K) Cluster the nodes according to proximity of their descriptors to the cluster exemplars v(1),…,v(K)
Theorem Assume that • Fixed number K of clusters, each of size (N) • Matrix (bkl) has full rank K • DC log(N) for some constant C Then with probability 1-o(1) , Algorithm partitions correctly fraction 1-o(1) of nodes for suitable A ( 1<< A << D1/2 ) Main tool: control of spectral structure of E-R graph adjacency matrix when average degree DC log(N) [Feige-Ofek]
Open question • Brute force Maximum Likelihood: retrieves clusters when D>>1 Efficient procedure under this assumption?
Another algorithmic version of Netflix • Objective: for user n, find inference of all unknown ratings that maximizes number of users fully agreeing with user n NP-hard (badly so) • Probabilistic model • Users belong to clusters k=1,…,K, with sizes a(k) N • Within a cluster, identical ratings (i.i.d., +1 or -1 w.p. ½ for each movie, F movies in total) • Each rating of each user: revealed w.p. p
Proposed algorithm(inspiration: compressive sensing; see [Decoding by linear programming, Candes&Tao]) • Consider user 1 • For suitable cost function g, determine full rating vectors X(n) , compatible with known ratings (i.e. PnX(n)=Y(n) ), that minimize • A proxy to (intractable) minimization of
Conditions for optimality • Assume optimum of (II) : “clustered” reconstruction X**(n) such that X**(n)=X**(1) for all indices n A • Then optimum of (I) such that X*(n)=X*(1),n A provided:
Application to probabilistic model • Necessary condition for hidden cluster to be optimal: • Sufficient condition for LP algorithm to retrieve hidden cluster, under choice g= |.|: Differ by factor at most K-1
Outlook • Clustering • Robustness of proposed schemes to statistical modeling assumptions • Efficient (distributed?) implementations