310 likes | 525 Views
‘Small-World File-Sharing Communities’ Iamnitchi, A. Ripenau, M. Foster, I. İsmail GÜNES 2003700287. OVERVIEW. Introduction Intuition The Data-Sharing Graph Three Data-Sharing Communities Small-World Data-Sharing Graph Human Nature or Zipf ’s Law
E N D
‘Small-World File-Sharing Communities’Iamnitchi, A. Ripenau, M. Foster, I. İsmail GÜNES 2003700287
OVERVIEW • Introduction • Intuition • The Data-Sharing Graph • Three Data-Sharing Communities • Small-World Data-Sharing Graph • Human Nature or Zipf ’s Law • Small-World Data-Sharing Graph: Significance for Mechanism Design • Conclusion
Introduction • To optimize ‘performance trade-off ’s, understand user behavior • Analyzing user behavior in 3 file-sharing communities to design efficient mechanisms • Propose a new structure(data-sharing graph) and justify it’s uses
Intuition • Understanding the system may help efficient solution design; • Relationship between file popularity & cache size. • Search is guided first to the nodes with high degree. • Study of networks started with Euler’s solution, gained momentum with internet. • Recurring patterns in real networks; • Power-law distribution, • Small worlds
The Data Sharing Graph • Capturing the virtual relationship between users who requests the same data. • Definition : Graph in which nodes are users and an edge connects 2 users with similar interests in data. • Analyzing the graphs of 3 file-sharing communities, • Discovering these graphs are small worlds, • Identify new structures by data-sharing graph, in real networks.
Three Data-Sharing Communities • Three communities; • 1) A high energy physics collaboration, • 2) The web, • 3) The Kazaa, peer to peer file sharing system. • Description of each community and its traces, • The file popularity and user activity distributions of each trace have high impact; • A user with high activity Highly connected node, • Highly popular files Produce dense clusters.
Three Data-Sharing Communities • The D0 Experiment: a High-Energy Physics Collaboration; • A virtual organization comprising hundreds of physicists from more than 70 institutions in 18 countries. • The purpose is to share the worldwide physics results. • Logs are analyzed over 6 months of 2002, about 23,000 jobs submitted by more than 300 users and involving more than 2,500,000 requests for about 200,000 distinct files.
The D0 Experiment(Cont’d) • The distribution of the number of files per job and file popularity.
The D0 Experiment(Cont’d) • The daily activity • In number of requests per day • user activity • In number of of requests submitted by each user during the 6-month interval • In D0, file popularity doesn’t fit the Zipf ’s law typical of web requests.
The Web • A five-day record from May 1999 of all HTTP requests from a large organization(Boeing) to the web. • Consider a user as an IP address. • 60,826 users sent 16,5 million web requests, of which 4,7 million requests were distinct.
The Kazaa Peer-to-Peer Network • A popular peer-to-peer file-sharing system with more than 4 million con-current users. • Kazaa nodes dynamically elect ‘supernodes’ • Regular nodes connect to super-nodes and act as querying clients to super-nodes • Control information is encrypted
The Kazaa Peer-to-Peer Network • Only the information about the files requested for download can be gathered, the information about the files searched for can not be gathered. • The five days of Kazaa traffic, during which 14,404 users downloaded 976,184 files, of which 116,509 were distinct were accessed.
SMALL-WORLD DATA-SHARING GRAPH • Users are nodes in the graph and 2 users are connected if they have similar interests in data • Similarity criteria: Size of the intersection of their request sets compared to some thresold • Similarity criterion has two degrees of freedom : • The length of the time interval • The thresold on the number of common requests
Distribution of Weights • Think of data-sharing graphs as weighted graphs • 2 users are connected by an edge labeled with the number of shared requests during a period. • The distribution of weights highlights differences among the sharing communities;
Degree Distribution • The Kazaa data-sharing graph is the closest to a power-law, while D0 graphs clearly are not power-law.
Small-World Characteristics • Watts-Strogatz definition: A graph G(V,E) is a small world if it has small average path length and large clustering coefficient, much larger than that of a random graph with the same number of nodes and edges. • The Clustering Coefficient: A measure of how well connected a node’s neighbors are with each other. CCu= CC1 = CC2 =
Small-World Characteristics(cont’d) =(Clustering Coefficient of a random graph) • Average Path Length:The average of all distances. • For large graphs, measuring all-pair distances is computationally expensive. • Approximation is made(%5); Ir = (Average Path Length)
Small-World Characteristics(cont’d) • The data-sharing graphs for the three systems display small-world properties(large clustering coefficient, small average path length)
Small-World Characteristics(cont’d) • The data-sharing graphs with different durations and similarity criteria are all small worlds • Well connected clusters • Small path between any 2 nodes
HUMAN NATURE OR ZIPF ’ S LAW ? • Question: Are the small-world consequences of previously documented patterns or do they reflect a new observation concerning user’s preferences in data? • 2 directions to answer the causality question: • Stress data-sharing graph and question the large clustering coefficient as a result of the graph definition-Affiliation networks • Analyze the effect of well-known patterns in file access(time locality, file popularity distribution)-Influences of zipf ’s law and time and space locality
Affiliation(Preference) Networks • A social network in which the actors are linked by common membership in groups or clubs of some kind. • Collaboration networks, movie actors etc. • Bipartite graphs; • 2 types of vertices, for actors and groups • Edges link nodes of different types only • Unipartite projection; • Undirected edges that connect actors in the same group
Affiliation Networks(cont’d) • Characteristics of projections of bipartite graphs: 1.Larger clustering coefficient than random graphs • Members of a group will form a complete subgraph in the one-mode projection 2.Degree distribution is far from the Poisson distribution of a random graph. - 2 degree distributions(of actors and of groups)
Affiliation Networks(cont’d) • Consider a bipartite affiliation graph of N actors and M groups • Pj : The probability of that an actor is part of exactly j groups • Pk : The probability that a group consists of exactly k members 3 functions defined to compute avg. node degree and clustering coef. of unipartite affiliation network : f0(x) = AvgDegree = G’0(1) g0(x) = Clustering Coef.(C) = G0(x) = f0( g’0(x)/g’0(1))
Affiliation Networks(cont’d) • Table confirms our inituition; • Difference between the values of measured and modeled parameters • Table shows 2 observations; • Actual clustering coefficient is always larger than theoritical one, The average degree is always smaller than theorotical one • We Can compare 3 communities by comparing distance from theoretical model
Influences of Zipf ’s Law and Time and Space Locality • Event frequency follows a Zipf ’s distribution in many systems • Time Locality : Users are not uniformly active during a period, but follow some patterns(download more in weekends, holidays etc.) The Question is, “Are the patterns we identified in the data-sharing graph, especially the large clustering coefficient, an inherent consequence of these well-known behaviors?” - To answer, generate random traces that keep the documented characteristics but break the user-request association - By these synthetic traces build the resulting data-sharing graphs and analyze, compare their properties.
Synthetic Traces • The content of traces are user ID, item requested and request time. • (1)User-Time: • (2)Request-Time: • (3)User-Request: • (4)User: • (5)Time: • (6)Request: • Aim is • To break the relationship (3), requires the break of (1) and (2), or both • To preserve the relationship (4), (5) and (6)
Properties of Synthetic Data-Sharing Graphs: • Three characteristics of the synthetic data-sharing graphs are relevant: 1) The number of nodes in synthetic graphs is significantly different than in their corresponding real graphs 2) The synthetic data-sharing graphs are always connected 3) The synthetic data-sharing graphs are “less” small worlds than their corresponding real graphs These imply that user preferences for files have significant influence on the data-sharing graphs Identifying small-world properties is not sufficient to characterize the clustering of users.
Small-World Data-Sharing Graph: Significance for Mechanism Design • The data-sharing graph can identify the structure of an organization by identifying interest-based clusters of users and then use this information to optimize an organization’s infrastructure (servers, network topology etc.) • Mechanism design of the data-sharing graph from 2 perspective: • Its structure • Its small-world properties
Small-World Data-Sharing Graph: Significance for Mechanism Design • Relevance of the Graph Structure: • Efficient update • File replication • Job management • Relevance of the Small-World Feature: • File-location
Conclusion • A new structure “Data Sharing Graph” is proposed • Acquires the relationship between users who request the same data • The properties of data sharing graphs in 3 communities are presented • The effects of zipf’s law and human nature on small-world characteristics are examined • The properties may be used for new peer-to-peer mechanism design