630 likes | 832 Views
Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems. Exponential random graph models for social networks and their role in uncertainty quantification. Pip Pattison University of Melbourne. University of Southern California April 13-14, 2009.
E N D
Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems Exponential random graph models for social networks and their role in uncertainty quantification Pip Pattison University of Melbourne University of Southern California April 13-14, 2009
Joint work with University of Melbourne Garry Robins Peng Wang Galina Daraganova Johan Koskinen Dean Lusher Oxford University/Groningen University Tom Snijders Building on work with: Mark Handcock University of Washington Dave Hunter Penn State University Martina MorrisUniversity of Washington Steve Goodreau University of Washington and earlier work by: Frank, Strauss, Wasserman …
Main argument • Many social processes (eg diffusion of information, diseases) depend on social interactions and social relationships, and different types of networks enable and constrain social processes in distinctive ways • We need therefore to understand/model both the structure and dynamics of different types of networks and social interaction and the nature of network-mediated social processes • Appropriate quantification of uncertainty requires good models and hence advances in: • Measurement and design • Understanding of relevant social processes • Careful model development and empirical testing • Exponential random graph modelling approach provides a coherent framework for model development and uncertainty quantification
Outline 1. Exponential random graph models for social networks 2. Models for large networks from partial network data 3. Where next?
Models for social networks: the problem To develop a statistical framework for modelling networks that appropriately represents what we know/propose about network tie formation processes: Exogenous effects: Shared characteristics, interests and affiliations, and spatial propinquity, all matter Endogenous network effects: e.g., clustering, comparison and attachment processes To do so in a way that affords a consistent approach to the analysis of various forms of longitudinal and cross-sectional social relational data
i j Network variables We assume a fixed set of actors We consider a system of tie variables: Y = [Yij] Yij = 1 if i has a tie to j 0 otherwise realisation of Y is denoted by y = [yij] Note that: • We consider the system of all tie variables at once • The variables are associated with relational ties between actors • We do not assume that the variables are independent
Local interactivity and dependence Local interactions: define two network tie variables to be neighbours if they are conditionally dependent, given the values of all other tie variables A neighbourhoodis a set of mutually neighbouring variables and corresponds to a potential network configuration: 1 e.g. {Y12, Y13, Y23} corresponds to 23 Dependence structure: hypothesis about which ties are neighbours
A useful dependence structure: the social circuit model (Snijders, Pattison, Robins & Handcock, 2006) Assume that two tie variables are neighbours if: • they share an actor • their presence would create a 4-cycle red ties are already present A 4-cycle is a closed structure that can sustain mutual social monitoring and influence, as well as levels of trustworthiness within which obligations and expectations might proliferate (e.g., Coleman, 1988; Bearman et al, 2005)
Hammersley-Clifford theorem (Besag, 1974) applied to networks (Frank & Strauss, 1986) P(Y= y) = (1/()) exp[QQzQ(y)] normalizing quantity parameter network statistic the summation is over all neighbourhoods Q zQ(y) = YijQyijsignifies whether () = yexp[QQzQ(y)] all ties in Q are observed in y
This means… parameters We model the probability of a network in terms of propensities for configurations of certain types to occur: Edges Stars (k-stars) Multiple paths (k-2-paths) Multiple triangles (k-triangles)
Equality and other constraints on model parameters • Assume (in the first instance) that isomorphic configurations have equal parameters, so there is one parameter for each class of network configurations • It may sometimes also be convenient to assume a relationship between relatedparameters, eg: k = -k-1/, for k >2 and 1 a (fixed) constant Star configurations … Parameters2 3 4 … We then obtain a single parameter (2) for the family of star configurations with statistic: S[](y) = k(-1)kSk(y)/k-2alternating star statistic 3. Likewise for k-2-paths and k-triangles …
Exponential random graph models for networks We can then model the probability of a network in terms of propensities for certain families of network configurations to occur alt-star edge alt-2-path alt-triangle
The zone of order k of a set A of nodes Zk(A), the zone of order k of A in the network y, to be the set of nodes within k steps of some node in A: For the two node set comprising marked vertices: Zones of order 0 to 2
Three generations of models (so far) for nondirected graphs Tie variables Yij and Ykl are conditionally independent unless: (i,j) = (k,l) same zone of order 0 Bernoull i Z0({i,j}) = Z0({k,l}) for {i,j} and {k,l} (Erdös-Rényi) Z0({i,j}) Z0({k,l}) overlapping zonesMarkov of order 0 Z1({i,j}) Z0({k,l}) zero-order zones for realisation- and {i,j) and {k,l}jointlydependent Z1({k,l}) Z0({i,j}) embeddedin firstorder (social zonesfor {k,l} and {i,j} circuit)
m nodes in all m nodes m nodes m nodes m nodes h nodes h nodes m nodes Hierarchy of models Bernoulli: Z0{i,j} = Z0{k,l} Markov: Z0{i,j} Z0{k,l} Social circuit: Z1{i,j} Z0{k,l} and Z1{k,l} Z0{i,j} Degree/closure interaction: Z1{i,j} Z0{k,l} or Z1{k,l} Z0{i,j} Three-path: Z1{i,j} Z0{k,l} Z1{k,l} Z0{i,j} (m,h)-coat-hanger
Simulation A valuable tool for understanding a model (and hence for understanding how the various configuration propensities combine to give networks a particular structural signature) Increasing triangulation 60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges) Varying the propensity for alternating-triangles The movie shows one representative graph from each simulated distribution
From centralization to segmentation • 60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges) • Varying the propensity for alternating-stars • The movie shows one representative graph from each simulated distribution
The inference problem Given an observed network, can we infer the propensities for various configurations in the model that may have generated it, and quantify their uncertainty? We use Monte Carlo Markov Chain Maximum Likelihood Estimation (Snijders, 2002) implemented in PNet (Wang, Robins & Pattison): http://www.sna.unimelb.edu.au/pnet/pnet.html This uses the Polyak-Ruppert variant of the Robbins-Monro procedure to solve for in the moment equation E{z(Y)} = z(y) See also statnet: http://csde.washington.edu/statnet (Handcock et al, 2003)
Example 1: Co-authorship in the journal Social Networks (Sunbelt XXVIII)
Parameter estimates Parameter estimate standard error Edge: -8.736936 0.378 Isolates: -1.954926 0.253 Alt-Star(=2.00) 0.083514 0.148 Alt-triangle(=2.00): 3.186715 0.100 Alt-2-path(=2.00): -0.043458 0.027 edge isolates alt-star alt-triangle alt-2-path observed statistics 568 120.0 1018.6 631.2 1333.2 mean statistics for model: 567.3 112.0 1017.4 631.9 1327.1
Heuristic goodness of fit: degree statistics The t statistic locates the observed value of each statistic in the distribution of statistics associated with the ergm simulated using model parameters: if t 2, the observed statistic is within the envelope expected by the model statistic observed simulated mean (sd) t # 2-stars: 1921 1789.0 (128.66) 1.03 # 3-stars: 3401 2808.3 (555.8) 1.07 # Std Dev degree dist: 2.187 2.108 (0.102) 0.77 # Skew degree dist: 2.071 1.775 (0.281) 1.06
Heuristic goodness of fit:Path-based measures statistic observed simulated mean (sd) t # 2-paths: 1921 1789.0 (128.66) 1.03 # 3-paths: 8728 7671.9 (1295.0) 0.82 Geodesic distribution Quartile Median for sampled graphs Observed First 553 553 Second 553 553 Third 553 553
Heuristic goodness of fit:Closure measures statistic observed simulated mean (sd) t # 1-triangles: 389 303.5 (18.8) 4.56 # 4-cycles: 1033 520.3 (86.2) 5.95 # (1,1)-coathangers 5189 3372.1 (565.0) 3.22 # (1,2)-coathangers 15275 8191.6 (2731.9) 2.59 # cliques of size 4 294 90.6 (13.5) 15.09 Global Clustering: 0.607 0.510 (0.02) 4.65 Mean Local Clustering: 0.396 0.309 (0.02) 3.92 Variance Local Clustering: 0.217 0.168 (0.01) 6.62
Model 1 Effect Parameter Std Err Edge 0.2238 2.07641 K-Star(=2.00) -0.8892 0.55314 AKT-T(=2.00) 1.2592 0.26601 A2P-T(=2.00) -0.1545 0.02705
Model 1: goodness of fit Effect observed mean stddev t-ratio # 2-stars 2904 2668.6 767.3 0.31 # 3-stars 13752 10890.5 4149.0 0.69 # 1-triangles 451 348.1 107.8 0.95 # 2-triangles 4617 2423.8 1069.5 2.05 # bow-ties 28853 14366.1 7675.1 1.89 # 3-paths 56721 51831.7 12677.6 0.39 # 4-cycles 3880 2812.5 1262.0 0.85 # (1,1)-coathangers 18463 12554.4 5150.0 1.15 # cliques of size 4 448 164.1 75.6 3.75 # cliques of size 5 234 23.1 17.1 12.3 Std Dev degree dist 5.44 3.92 0.43 3.50 Skew degree dist 0.39 -0.26 0.386 1.70 Global Clustering 0.47 0.39 0.017 4.22 Mean Local Clustering 0.45 0.42 0.026 3.11 Variance Local Clustering 0.03 0.02 0.014 1.01
Model 2 Effect Parameter Std Err edge -0.3017 2.83532 1-triangle 0.7466 0.20502 2-triangle -0.1548 0.05167 3-path -0.0167 0.00572 4-cycle 0.0713 0.03052 (1.1)-coathanger 0.0333 0.01131 clique of size 4 0.4025 0.17083 Alt-Star(=2.00) -0.1969 0.81080
Model 2: goodness-of-fit Effect observed mean stddev t-ratio # 2-stars 2904 2845.2 294.6 0.20 # 3-stars 13752 13171.8 2057.3 0.28 # bow-ties 28853 27164.9 7767.5 0.22 # cliques of size 5 234 208.4 116.5 0.22 # alt-triangles (2.00) 406.4 400.6 29.2 0.20 # alt-indpt.2-path(2.00) 1138.2 1115.9 61.7 0.36 Std Dev degree dist 5.44 5.264 0.420 0.41 Skew degree dist 0.395 0.306 0.386 0.23 Global Clustering 0.466 0.458 0.026 0.30 Mean Local Clustering 0.498 0.465 0.032 1.01 Variance Local Clustering 0.031 0.023 0.01 0.87
What have we learnt about network topology? “Social circuit” models appear to reflect social processes underlying network formation better than simple Markovian neighbourhoods, having a “capacity for actors to transform as well as reproduce long-standing structures, frameworks and networks of interaction” (Emirbayer & Goodwin, 1994) Hypotheses about relationships among the values of related parameters can provide a practical and effective means of incorporating important higher-order configurations We may often need to add terms for cliques of size greater than 3 It may sometimes be necessary to go beyond the social circuit model [And network effects do depend on actor and relational attributes, and are often mutually dependent across multiple and multi-mode networks]
Moreno’s network dream “If we ever get to the point of charting a whole city or a whole nation, we would have … a picture of a vast solar system of intangible structures, powerfully influencing conduct, as gravitation does in space. Such an invisible structure underlies society, and has its influence in determining the conduct of society as a whole.” J. L. Moreno, New York Times, April 13, 1933 (via James Moody)
The problem: Estimating models for large networks from sampled data Many networks of interest, including community-level networks and biological networks are very large and observing a complete network can be costly and difficult We consider the problem of estimating the model P(Y= y) = (1/()) exp{ppzp(y)} using data from snowball sampling designs, assuming, for the moment, a model with social circuit dependence assumptions Handcock & Gile (2007) and Koskinen et al (2008) consider the same problem as a missing data problem
Daraganova et al (2008): A partial network among Brimbank respondents from a snowball sampling design (yellow=wave 1, green=2, red=wave 3)
Snowball sampling designs Multi-wave snowball sampling: We observe ties of: Wave 0: Nodes in Z0 Wave 1: Nodes in Z1 but not Z0 Wave 2: Nodes in Z2 but not Z1 yk: network on Zk\Zk-1 ykl: ties from Zk to Zl\Zl-1
Conditional estimation strategy We make the social circuit assumption and follow Besag (1974), make apositivity assumption, Pr(Y0=0rest) > 0. We show that: Pr(Y0=y0rest) log ------------------- = p p [zp(y0+1) - zp(y0+10)] Pr(Y0=0rest) where y0 is equal to y but with all entries in y0 set to 0 Defining 1/c = Pr(Y0=0 rest) yields: Pr(Y0=y0rest) = 1/c exp (p p [zp(y0+1) - zp(y0+10)] ) and hence the capacity to use observed data on y0+1 to obtain conditional MLEs of = [p]
3-wave snowball sample For a 3-wave sample and positivity assumption: Pr(Y0=0,Y1=0rest) > 0 we obtain: Pr(Y0=y0,Y1=y1rest) = 1/c exp (pp[zp(y0+1+2)-zp(y0+1+20,1)]) where y0,1 is equal to y but with all entries in y0 and y1 set to 0 And hence we can use observed data on y0+1+2 to obtain conditional MLEs of
MCMCMLEs from single networks sampled from the random graph distribution with known parameters (-4,.2,-.2,1), n = 150 true value alt-star edge alt-2-path alt-triangle
MCMCMLEs from y0, conditional on y01, y1(size of Z0 =10) alt-star edge alt-2-path alt-triangle
Conditional MCMCMLEs from y1, conditional on y0, y01, y12 , y2and assigning isolated nodes and dyads in Z0 to Z1 (Z0 = 10) alt-star edge alt-2-path alt-triangle
What if we ignore the sampling design and use “available cases”?MCMCMLEs from network on Z0+Z1+Z2 alt-star edge alt-2-path alt-triangle
Using data on y0+1 For a fixed model: Edge -4.0 Alr-star 0.2 Alt-triangle 1.0 Alt-2-path -0.2 Size of node set/random seed sets: 150 (15, 30, 50, 100, 150) 500 (30, 50, 70, 100, 200) 1000 (30,50, 70, 100, 200) Simulation study
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1n = 150
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 500)
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 1000)
Simulation study: summary findings • RMSE and bias decline as seed set size increases for fixed n • For a sufficiently large seed set size, bias is small for each n • For given n and seed set size, bias is greatest for edge and alt-star effects • For given seed set size, bias is greater as n increases, although the effect is less pronounced for alt-triangle and alt-2-path effects