Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096

Motivation • In resource monitoring and assessment, time and expense constraints may make two-stage sampling more efficient • Select a sample of watersheds; sample different bodies of water within selected watersheds • Select a sample of lakes; sample at different locations in selected lakes • Samples are not always sufficiently dense in small watersheds; availability of cheap auxiliary information (primarily from GIS) suggests incorporating a model • Auxiliary information may be available on different scales • Often many study variables; rather than fit a model for each one, would like one set of weights that can be applied reasonably well to all variables, i.e.,

Outline • Two-stage structure • Model-free, model-assisted, and model-based estimators • Penalized splines • Simulation results • Properties of model-assisted estimator using penalized spline

Two-Stage Structure • Population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, . So,where Ni is the number of elements or secondary sampling units (SSUs) in Ui.

Case A: Cluster Level Auxiliaries (Our focus) • The auxiliary information is available for all clusters in the population • Leads to regression modeling of quantities associated with the clusters, such as cluster totals and means • Cluster quantities can be computed for all clusters • Population quantities can be computed from cluster estimates • Example: Lake represents a cluster; auxiliary information is elevation

Case B: Complete Element Level Auxiliaries • The auxiliary information is available for all elements in the population • Leads to regression modeling of quantities associated with the elements • Cluster and population quantities can then be computed from element estimates and observations • Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation

Case C: Limited Element Level Auxiliaries • The auxiliary information is available for all elements in selected clusters only • Leads to regression modeling of quantities associated with the elements • Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample • Population-level quantities can be estimated using design-based estimators • Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial

Case D: Limited Cluster Level Auxiliaries • The auxiliary information is available for all clusters in the first-stage sample • Not a very interesting case • Design-based estimator can be used for population quantities • Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited

Sampling • First stage: A sample of clusters, sI, is selected based on a design, pI(·) with inclusion probabilities Ii and Iij • Ii and Iij are the first and second order inclusion probabilities, respectively • Second stage: For every i  sI, a sample si is drawn from Ui based on the design pi(· | sI) • Typically require second stage design to be invariant and independent of the first stage

Other Notation • is the total for the variable yover the entire population • Where required, we will assume the population model:where i is the mean of the y’s in PSU i • xi is some auxiliary variable that is a known quantity (usually a total or mean) for PSU i

The Estimators (for population totals) • Model-free • Model-assisted • Model-based

Model-Free Estimator • If no other information than the sampling design is available, the Horvitz-Thompson Estimator is often usedwhere • Notes: • Always design unbiased • Variance is large for small sample sizes • Does not make use of auxiliary information

Model-Assisted Estimator where is the PSU total predicted by the model • Properties: • Asymptotically unbiased and consistent even if model is misspecified • Variance is generally smaller than with HT, but larger than with the model-based estimator • Can incorporate auxiliary information

Model-Based Estimator where is the ith PSU mean predicted by the model • Properties: • Unbiased if model is correctly specified • Variance is generally smaller than with HT • Can incorporate auxiliary information

Notes on the Models • 3 different models considered • Linear • Penalized spline with random effect for PSU • Penalized spline with no random effect for PSU • Extend model specification for penalized spline with random effect for PSU:where yij is the response for the jth element in PSU i

Penalized Splines (P-Splines) • With a linear model, we assume • For a penalized spline,where 1 < …< K are K fixed knots and

Simulation Study • 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) • PSU = f(I) + , where f(·) is one of eight functions and  ~ N(0, 2I) • We use first order inclusion probabilities proportional to size (pps) • Auxiliary data is often proportional to size of cluster • Generate the response of interest yij = i + ij where yij is the jth element in the ith cluster and ij ~ iid N(0, 2)

First Four Functions

Second Four Functions

Some Simulation Results

More Simulation Results

Why not use model-based? • In survey contexts, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: • Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach • With model-based, sampling design is ignored and estimates rely solely on the form of f(·)

Relative MSE (Fitting to bump)

Relative Bias (Fitting to bump)

Relative Variance (Fitting to bump)

Properties of Model-Assisted Estimator • The penalized spline estimator, , is linear operator • It is location and scale invariant, in the sense thatprovided an intercept is kept in the model and

Properties of Model-Assisted Estimator • Under mild assumptions, the penalized spline estimator, , is design -consistent for ty, in the sense that and has the following asymptotic distributional property:

Properties of Model-Assisted Estimator • Again, under mild assumptions, the estimator • The previous two results lead to:

Summary • Two-stage sampling designs are used frequently in natural resource monitoring and assessment • Sample sizes are often sparse; model-free estimators will have high variance • Model-based estimators make use of auxiliary information and have good properties provided model is correctly specified • Modeling with p-splines solves problem of correctly specifying model • Often, model can’t be fit to all study variables; model-assisted estimators still have reasonably good properties when weights from one model are applied to all study variables

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design