440 likes | 540 Views
Estimating Clique Composition and Size Distributions from Sampled Network Data. Minas Gjoka , Emily Smith, Carter T. Butts. University of California, Irvine. Outline. Problem statement Estimation methodology Results with real-life graphs. Cliques.
E N D
Estimating Clique Composition and SizeDistributions from Sampled Network Data Minas Gjoka, Emily Smith, Carter T. Butts University of California, Irvine
Outline • Problem statement • Estimation methodology • Results with real-life graphs
Cliques A complete subgraph that contains i vertices is an order-i clique order-1 order-2 A maximal clique is a clique that is not included in a larger clique order-3 order-4 order-5 … order-i
Cliques A complete subgraph that contains i vertices is an order-i clique A maximal clique is a clique that is not included in a larger clique order-3 b b b a c a c order-4 d d 4 non-maximal order-3 cliques d b a c a c d
Counting of Cliques Ciis the count of order-i cliques (maximal or non-maximal) order-1 C1 graph G order-2 C2 3 2 1 4 5 order-3 C3 8 6 7 order-4 C4 Clique Distribution of G C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 ) Goal 1: Estimate Ci(for all i) in graph G from sampled network data
Counting of Cliques Vertex Attributes Vertex Attribute vector Xj j=1..p, p<=N p =3 graph G 3 2 u =[ 3 0 0 ] 1 4 5 8 u =[ 2 1 0 ] 6 7 u =[ 2 0 1 ] Clique Composition Distribution of G Cu is the count of order-u cliques Goal 2: Estimate Cu (for all u) in graph G from sampled network data
Motivation • Counting of Cliques • cliques describe local structure (clustering, cohesive subgroups) • algorithmic implications of cliques in engineering context • cliques used as input in network models • Sampled network data • unknown graphs with access limitations • massive known graphs
Related Work • Model-based methods • Do not scale • Do not help with counting • Design-based methods • Subgraph (or motif) counting tools that use sampling e.g. MFinder, FANMOD, MODA • No support for subgraphs of size larger than 10 • No support for vertex attributes • Biased Estimation
Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n uniform independence sampling weighted independence sampling link-trace sampling with replacement without replacement
Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n graph G(V,E) 3 2 1 4 4 5 n=2 C3 8 6 7 7
Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: • Fetch the egonet of each sampled node: Vj, X[Vj] j=1..n G[Vj] j=1..n graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 6 7
Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 6 7
Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • can use existing exact clique counting algorithms • clique type is determined by counting algorithm. Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7
Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • Apply estimation method that combines calculations • Clique Degree Sums (CDS) • Distinct Clique Counting (CC) Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7
Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • Apply estimation method that combines calculations • Clique Degree Sums (CDS) • labeling of neighbors not required, more space efficient • Distinct Clique Counting (CC) • higher accuracy Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7
Labeling of neighbors C3 8 7 1 9 6 2 5 4 3 graph G
Labeling of neighbors Vj, X[Vj], G[Vj] C3 8 8 7 7 1 1 9 9 9 6 6 6 2 2 5 5 5 4 4 3 3 graph G n=2
Labeling of neighbors • Distinct Clique Counting (CC) • labeled neighbors 8 7 Labeled Neighbors C3 9 9 6 6 8 7 Calculate count C3 5 5 1 9 6 9 9 6 6 2 5 5 5 5 5 5 4 3 4 4 4 3 3 graph G n=2
Labeling of neighbors • Distinct Clique Counting (CC) • labeled neighbors • Clique Degree Sums (CDS) • unlabeled neighbors 8 7 Labeled Neighbors C3 9 6 9 9 6 5 8 7 Calculate count C3 5 5 4 3 1 9 6 9 6 2 5 Calculate count C3 5 5 5 5 5 4 3 4 4 3 Unlabeled Neighbors graph G n=2
Clique Degree Sums unlabeled neighbors Order-i Clique Degree dij contains the number of i-cliques that node j belongs
Clique Degree Sums unlabeled neighbors graph G (V,E) Order-i Clique Degree dij contains the number of i-cliques that node j belongs 6 4 3 8 8 7 5 2 H8 1 d38 = 2 C3
Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Di is the Order-iClique Degree Sum
Clique Degree Sums unlabeled neighbors graph G (V,E) All nodes 6 4 3 Number of i-cliques that node j belongs d38 8 8 7 5 2 Di is the Order-iClique Degree Sum 1 C3 D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38 D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2 D3 = 9 D3 = 3C3
Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ()
Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Number of u-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ()
Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Node inclusion probability Joint node inclusion probability
Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and • Uniform Independence Sampling • Weighted Independence Sampling • Link-trace Sampling • Without replacement • With replacement
Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and • Uniform Independence Sampling • Without replacement Sampled nodes All nodes Node inclusion probability Joint node inclusion probability
Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) • Uniform Independence Sampling • Weighted Independence Sampling • Link-trace Sampling • With replacement • Without replacement
Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) • Uniform Independence Sampling • With replacement
Distinct Clique Counting labeled neighbors graph G 6 4 3 a 8 C3 7 5 2 b c N=8 1 n=4 UIS with replacement
Distinct Clique Counting labeled neighbors graph G 6 4 3 a 8 C3 7 5 2 b c N=8 1 n=4 UIS with replacement Observed order-3 cliques 6 6 5 5 2 2 8 8 1 1 7 7 Distinct order-3 cliques 6 5 2 8 1 7
Computational complexity • Space complexity to count Ci or Cu • O(1) for Clique Degree Sums Method • O(ci) or O(cu) for Distinct Clique Counting Method • Time complexity • from O(3N/3) to O(n*3D/3) where N is the graph size, D is the maximum degree, and n is the sample size • from O(n*3D/3) to O(3D/3) via parallel computations per egonet
Benefits of our methodology • Full knowledge of graph not required • Fast estimation for massive known graphs • Estimation or exact computation easily parallelizable for massive known graphs • Estimation with or without neighbor labels • Supports vertex attributes • Supports a variety of sampling designs
Simulation ResultsFacebook New Orleans Distinct Clique Counting Clique Degree Sums Egonet sample size n=1,000 Uniform independence sampling, without replacement 1000 simulations
Simulation Results 1000 simulations Error metric Normalized Mean Absolute Error : Clique Degree Sums Distinct Clique Counting
Simulation Results Clique Degree Sums Distinct Clique Counting
Which estimation method to use?Heuristic All edges between egos and neighbors Average Edge Count = Unique edges between egos and neighbors graph G 6 4 3 n=3 6 8 6 5 6 5 2 2 7 8 5 8 2 8 1 1 7 7 7 N=8 1 a 9 Average Edge Count = = 1.5 b c 6
Estimation ResultsFacebook ‘09 • Facebook ‘09 crawled dataset[1] • 36,628 unique egonets • [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.
Estimation Resultsvertex attributes, Facebook ‘09 • Complemented dataset with gender attributes • about 6 million users
Unbiased estimation methods of clique distributions • Clique Degree Sums • Distinct Clique Counting • Facebook cliques • Future work • support estimation of any subgraphs (beyond cliques) References • [1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 . • [2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html • [3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators • Thank you!