240 likes | 363 Views
Sampling Research Questions. Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10. Introduction.
E N D
Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10
Introduction • At the end of the opening workshop the group in Sampling, Modeling, and Inference raised a number of open questions related to sampling. • Today I will discuss those questions, most of which are still unsolved.
Goal of Sample-Based Inference • What is the target of the inference? • a stochastic model that generated a network or set of networks • population of networks, e.g., dynamic networks • multiple networks on a single population of edges • single network
Various Network Sampling Designs • Conventional sample design to learn about the network • probabilities do not depend on observed data • E.g., Current Population Survey • Adaptive sample design using the network • probabilities may depend on observed data • E.g. RDS; ego-centric samples; link-tracing designs • Two-phase sampling to target further investigation of missing data or measurement error • Subsampling (?) to reduce computational burden at possible loss of efficiency
Conventional Sampling Design to Learn about the Network(s) Samples of nodes or of edges - used for • description of network(s) • prediction of future state of network • prediction of links/gaps/nodes • fitting a model to the graph
Limitations from Sampling • Sampling introduces random error into the estimates (and possibly bias, since Ef(X) ≠ f(EX) for nonlinear f) • Sampling variance needs to be estimated, maybe bias does too; may be problematic for small samples • Some population characteristics may not be “estimable” from a sample • E.g., maximum path length between any two nodes? • Number of components in a general graph? • What does “estimable” mean?
Limitations from Sampling • If elements of interest (edges/non-edges, stars, motifs, etc.) have unequal probabilities of being observed, then • need to know the probabilities and adjust for them • or, need to have a model that explains the population • or, sometimes, both.
E.g.: Induced Graph Sampling • Undirected parent graph (V, G) • Sample nodes SV • Observe G(S) G – observe edge/non-edge between u, v iff u,v S • Conventional sampling with possibly unequal probabilities (including multiple- frame stratified multi-stage): probability of including u1,u2 ,...,uj and excluding u1,u2 ,...,vk knowable for any j, k • Denote inclusion probabilities by
H-T Estimators of Triad Distribution Define Tk,u,v,w = 1 if u,v,w are distinct vertices sharing k edges and = 0 otherwise Tk number of triads in E with 0 <k< 3 edges Other totals estimated similarly, e.g., number of stars or other motifs.
Degree Distribution • du degree of node u (its number of edges) • M maximum degree in (E, G) • Nr number of nodes of degree 0 <r<M • (F0,F1,…,FM) is degree distribution, with Fr =Nr /N • Degree distribution of the sample can differ from degree distribution of the population. “Subnets of Scale-Free Networks are Not Scale-Free: Sampling Properties of Networks” Stumpf, Wiuf, May (PNAS, 2005)
Estimation of Degree Distribution • Induced subgraph from SRS of size n from(E,G) • Nr number of nodes of degree r in parent graph • Nr(S) number of nodes of degree r in subgraph
Partial Recap • Using induced graph subsamples from conventional samples where joint inclusion probabilities are known, we can estimate • population values of descriptive statistics based on totals • degree distribution. • (Only undirected graphs at one point in time discussed.) • What about • other descriptive statistics • model fitting • large variances when sample size small • adaptive samples?
Approaches to Model Fitting • You trust* your model. • Under certain conditions** on the sample design and the model, you can ignore the way the sample was selected and treat the sample as having been generated from the model. • The sampling mechanism needs to be carefully examined to make sure it meets the requirements, which depend on the model being used. * Reagan and others, “trust but verify” ** Handcock and Gile (2010 AoAS) call the condition “amenability” and relate it to “ignorability” (Rubin 1976).
Approaches to Model Fitting • “Model as descriptive statistic”. You do not necessarily believe the model, but you want to fit the model the way you would if you completely observed the population. • Anathema to many social scientists. . . • E.g., in ERGMs, model fitting for population depends on sufficient statistics that are population totals. One can estimate them with H-T estimates (or alternatives) and then fit model. (Pavel Krivitsky poster) • I have not investigated how to implement for other models. • If both approaches are tried, “large” differences in fits can indicate model misspecification.
Adaptive Sampling • Probabilities of observations depend on data from sampled units. • Provides more information about network than conventional samples (Frank). Note: variances may be too large when sample is conventional but sparse. • Probabilities of observing triads and larger typically unavailable, and even probabilities for dyads known for ego-centric designs but not link-tracing designs. (H-G 2010) • In order to use full data, either need to estimate unknown probabilities (hard!!) or rely on model if amenability condition can be verified and model validated. • E.g., when using conventional unequal probability samples to estimate a population total, the amenability condition typically does not hold.
Model Validation • Model validation is important, but challenging when sampling probabilities are unknown. • At the heart of every adaptive sample is a conventional sample. • Use conventional sample to fit model as descriptive statistic. Compare result to model fitted under assumption of ignorability/amenability for (i) conventional sample and (ii) larger and more informative adaptive sample.
Recap • What is the population (network, or set of networks) from which sample is selected? • Sample design (and inference) to learn about the network • Static • Over time • Description of network • Prediction of future state of network and prediction of links/gaps/nodes
Recap • Sample design (and inference) using the network to learn about a population • Respondent Driven Sampling • Adaptive Sampling • Others • Static and over time
Recap • Subsampling design (and inference) to • Ease computational burden • Target further investigation to learn about measurement error • When can inferences be made based on sample design information to provide approx. unbiasedness whether or not model is valid?
Recap • How can model inferences be made? • What models? • Exponential random graph models • Mixed membership stochastic block models • Latent space models • Agent based models • What network characteristics (what summary statistics)
Recap • What is effect of measurement error (and missing data, non-response) on inferences about network? • RDS samples • Others • How to design and analyze randomized experiments when subjects are part of a static network? Dynamic? • Google experiments • Experiments on adolescents in schools (e.g., drug counseling, obesity “treatment”) – effects on peers