320 likes | 472 Views
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design. Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096. Motivation.
E N D
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096
Motivation • In resource monitoring and assessment, time and expense constraints may make two-stage sampling more efficient • Select a sample of watersheds; sample different bodies of water within selected watersheds • Select a sample of lakes; sample at different locations in selected lakes • Samples are not always sufficiently dense in small watersheds; availability of cheap auxiliary information (primarily from GIS) suggests incorporating a model • Auxiliary information may be available on different scales • Often many study variables; rather than fit a model for each one, would like one set of weights that can be applied reasonably well to all variables, i.e.,
Outline • Two-stage structure • Model-free, model-assisted, and model-based estimators • Penalized splines • Simulation results • Properties of model-assisted estimator using penalized spline
Two-Stage Structure • Population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, . So,where Ni is the number of elements or secondary sampling units (SSUs) in Ui.
Case A: Cluster Level Auxiliaries (Our focus) • The auxiliary information is available for all clusters in the population • Leads to regression modeling of quantities associated with the clusters, such as cluster totals and means • Cluster quantities can be computed for all clusters • Population quantities can be computed from cluster estimates • Example: Lake represents a cluster; auxiliary information is elevation
Case B: Complete Element Level Auxiliaries • The auxiliary information is available for all elements in the population • Leads to regression modeling of quantities associated with the elements • Cluster and population quantities can then be computed from element estimates and observations • Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation
Case C: Limited Element Level Auxiliaries • The auxiliary information is available for all elements in selected clusters only • Leads to regression modeling of quantities associated with the elements • Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample • Population-level quantities can be estimated using design-based estimators • Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial
Case D: Limited Cluster Level Auxiliaries • The auxiliary information is available for all clusters in the first-stage sample • Not a very interesting case • Design-based estimator can be used for population quantities • Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited
Sampling • First stage: A sample of clusters, sI, is selected based on a design, pI(·) with inclusion probabilities Ii and Iij • Ii and Iij are the first and second order inclusion probabilities, respectively • Second stage: For every i sI, a sample si is drawn from Ui based on the design pi(· | sI) • Typically require second stage design to be invariant and independent of the first stage
Other Notation • is the total for the variable yover the entire population • Where required, we will assume the population model:where i is the mean of the y’s in PSU i • xi is some auxiliary variable that is a known quantity (usually a total or mean) for PSU i
The Estimators (for population totals) • Model-free • Model-assisted • Model-based
Model-Free Estimator • If no other information than the sampling design is available, the Horvitz-Thompson Estimator is often usedwhere • Notes: • Always design unbiased • Variance is large for small sample sizes • Does not make use of auxiliary information
Model-Assisted Estimator where is the PSU total predicted by the model • Properties: • Asymptotically unbiased and consistent even if model is misspecified • Variance is generally smaller than with HT, but larger than with the model-based estimator • Can incorporate auxiliary information
Model-Based Estimator where is the ith PSU mean predicted by the model • Properties: • Unbiased if model is correctly specified • Variance is generally smaller than with HT • Can incorporate auxiliary information
Notes on the Models • 3 different models considered • Linear • Penalized spline with random effect for PSU • Penalized spline with no random effect for PSU • Extend model specification for penalized spline with random effect for PSU:where yij is the response for the jth element in PSU i
Penalized Splines (P-Splines) • With a linear model, we assume • For a penalized spline,where 1 < …< K are K fixed knots and
Simulation Study • 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) • PSU = f(I) + , where f(·) is one of eight functions and ~ N(0, 2I) • We use first order inclusion probabilities proportional to size (pps) • Auxiliary data is often proportional to size of cluster • Generate the response of interest yij = i + ij where yij is the jth element in the ith cluster and ij ~ iid N(0, 2)
Why not use model-based? • In survey contexts, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: • Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach • With model-based, sampling design is ignored and estimates rely solely on the form of f(·)
Properties of Model-Assisted Estimator • The penalized spline estimator, , is linear operator • It is location and scale invariant, in the sense thatprovided an intercept is kept in the model and
Properties of Model-Assisted Estimator • Under mild assumptions, the penalized spline estimator, , is design -consistent for ty, in the sense that and has the following asymptotic distributional property:
Properties of Model-Assisted Estimator • Again, under mild assumptions, the estimator • The previous two results lead to:
Summary • Two-stage sampling designs are used frequently in natural resource monitoring and assessment • Sample sizes are often sparse; model-free estimators will have high variance • Model-based estimators make use of auxiliary information and have good properties provided model is correctly specified • Modeling with p-splines solves problem of correctly specifying model • Often, model can’t be fit to all study variables; model-assisted estimators still have reasonably good properties when weights from one model are applied to all study variables