1 / 28

A balanced Sampling approach for multiway stratification design for small area estimation

A balanced Sampling approach for multiway stratification design for small area estimation Piero Demetrio Falorsi - Paolo Righi ISTAT. Index. The issue of multivariate-multidomain sampling strategy The proposed sampling strategy Balanced sample for multiway stratification

Download Presentation

A balanced Sampling approach for multiway stratification design for small area estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A balanced Sampling approach for multiway stratification design for small area estimation Piero Demetrio Falorsi - Paolo Righi ISTAT

  2. Index • The issue of multivariate-multidomain sampling strategy • The proposed sampling strategy • Balanced sample for multiway stratification • Modified GREG estimator • The algorithm for the sample size definition • Application fields and experiments

  3. 1. The issue of multivariate-multidomain sampling strategy When planning a sample strategy for a survey aiming at producing estimates for several domains (defined as non-nested partitions of the population) an issue is to define the sample size so that the sampling errors of domain estimates of several parameters are lower than given thresholds. A sampling strategy is proposed here dealing with multivariate‑multidomain surveys when the overall sample size must satisfy budget constraints. The standard solution of a stratification given by cross-classification of the domain variables is often not feasible because the number of strata can be larger than the overall sample size. Moreover, even if the overall sample size allows covering all the strata, the resulting allocation could lead to an inefficient design.

  4. 1. The issue of multivariate-multidomain sampling strategy Population Planned and actual sample with cross-classification stratification

  5. 1. The issue of multivariate-multidomain sampling strategy Example: Business Structural Statistics 36.000 cross-classification strata

  6. Standard strategy 1. The issue of multivariate-multidomain sampling strategy Standard solution to obtain planned domains adopts cross-stratified sampling design by combining the domains Consequences: • when the population size in many strata is small, the stratification scheme could be inefficient; • if different partitions in domains of interest are not nested, the allocation of the sample in the cross‑classified strata may be substantially different from the optimal allocation for the domains of a given partition; • the sample size to cover all strata could be too large for the survey economical constrains; • dealing with surveys repeated over time, statistical burden may arise if there exist strata containing only few units in the population.

  7. 1. The issue of multivariate-multidomain sampling strategy One possible solution is the multi-way stratification: Several sophisticated solutions have been proposed to keep under control the sample size in all the categories of the stratifying variables without using cross-classification design. These methods are generally referred to as multi-way stratification techniques, and have been developed under two main approaches: • Latin Squares or Latin Lattices schemes (Bryant et al., 1960; Jessen, 1970); the indipendece among rows and columns is supposed. these methods work only if all the cross-strata exist in the population. • Controlled rounding problems via linear programming (Causey et al., 1985; Sitter and Skinner, 1994). Very computationally complex methods, not always get to a solution, inclusion probability (both simple and joint) cannot be computed immediately. The main weaknesses of these approaches derives from the computational complexity and moreover a solution is not always reached.

  8. 2. The proposed sampling strategy Aim of this work is to define a sample strategy that is optimal with regard to the sample scheme and to the estimator utilized, by exploiting the available auxiliary information in both phases: • Define a probabilistic sample method • Realize a multiway stratification based on balanced sampling, controlling the sample size of the margin domains • Use a modified GREG estimator • Define the sample allocation, aiming at controlling the sampling errors on margins, using a variance estimator taking into account jointly both the regression model under the GREG estimator and the balanced sampling design • The strategy may take into account a simple (Fay Herriot) Small Area Estimator The proposed overall sampling strategy is easy to implement and a software has been developed for each phase It is possible to extend it to different contexts (considering the anticipated variance or the use of indirect small area estimators) It is possible to develop a sample strategy for small area estimation considering the sample and estimation phases jointly

  9. 2. The proposed sampling strategy Notation Denote with: U the population of size N; Ubthe b-th partition in Mb domains Ubd , b=1,…, B, d=1,…, Mb the value of the (r = 1,…,R) variable of interest in the k‑th population unit the domain membership indicator n the overall fixed sample size r-th parameter of interest

  10. 3. Balanced sampling and multi-way stratification Balanced sampling is a class of designs using auxiliary information. Properties have been studied in the • model based approach (Royall and Herson, 1973; Valliant et al., 2000); • design based approach (Deville and Tillé, 2004, 2005). In the following we consider the design based or model assisted approach

  11. 3. Balanced sampling and multi-way stratification Let us define the sampling design p(.) with inclusion probabilities a design which assigns a probability p(s) to each sample s such that being a vector of sample indicators. Let be a vector of Q auxiliary variables known for each unit in the population. The sampling design p(s) is said to be balanced with respect to the Q auxiliary variables if and only if it satisfies the balancing equations given by being the sample weight

  12. 3. Balanced sampling and multi-way stratification Multi-way stratification design can represent a special case of balanced design, when for unit k the auxiliary variable vector is the indicator of the belonging to the domains of the different partitions multiplied by its inclusion probability The z vector, in this case, is defined as the balancing equations assure that for each selected sample s, the size of the subsample is a non-random quantity and is

  13. 3. Balanced sampling and multi-way stratification For multiway stratification the balancing equations become being the sample size for the d-thdomain of the b-th partition and

  14. 3. Balanced sampling and multi-way stratification A relevant drawback of balanced sampling has always been implementing a general procedure giving a multivariate balanced random sample. Deville and Tillé (2004) proposed a sample selection method (cube method) drawing a balanced samples for a large set of auxiliary variables and with respect to different vectors of inclusion probabilities. A free macro for the selection of balanced samples for large data sets may be downloaded (SAS or R routine) http://www.insee.fr/fr/nom_df_met/outils_stat/cube/accueil_cube.htm Deville and Tillé (2000) show that with our specification of the auxiliary vectors, the balancing equations can be exactly satisfied, while in general the balancing equation are approximately respected

  15. 4. Modified GREG estimator In the context of multi-variate estimation, the r-th parameter of interest is The modified GREG estimator is (through a specific domain weight) The superpopulation working model is

  16. 4. Modified GREG estimator: variance Variance of the Horvitz-Thompson estimator with the balanced sampling Deville and Tillé (2005) proposed an approximation of the variance expression for HT estimator and the overall domain with

  17. 4. Modified GREG estimator: variance Starting from the result by Deville (2005) it is possible to derive the approximate expression of the variance for the modified GREG estimator under balanced sampling being and

  18. 5. The algorithm for the sample size definition In order to calculate the inclusion probabilities it is necessary to fix the sample size for each domain so that the constraints on the sampling errors were accomplished When considering separately each marginal partition we would have for each of them a different set of inclusion probabilities In our methodology we calculate a single inclusion probability through a two step procedure • Optimisation (calculating of optimal probabilities) • Calibration (calculating of “working” probabilities)

  19. 5. The algorithm for the sample size definition Residual term Optimisation: the calculus of the inclusion probabilities (sample size and domain allocation) is carried out with the aim of minimizing the expected sampling errors on several domains and estimates: • Multi domains • Multi variable The problem is solved through the system The solution can be obtained through the Chromy algorithm (the one used in the software for allocation MAUSS, which can be can be downloaded from www.istat.it)

  20. 5. The algorithm for the sample size definition Calibration: optimal inclusion probabilities lead to non integer values for the domain sample size • Rounding of the expected domain sample size to next integer; • Calculating “working” probabilities nearest to the optimal ones The problem is defined through the system Solution obtained by means of the Newton algorithm (with some change), the same used in calibration software Genesees which can be can be downloaded from www.istat.it)

  21. 21 6. Application fields and experiments Artificial data Population – Contingency table Variable for the allocation and estimation model ,

  22. 6. Application fields and experiments Artificial data 22 Compared sampling designs and expected CV(%)

  23. 6. Application fields and experiments Real data A simulation on real enterprises data (N=10,392) has been carried out to evaluate the effects of planned sample size for small domain of estimate (Falorsi et al., 2006) : • U1 partition: regions (20 domains); • U2 partition: economic activity by size class (24 domains); • Cross-classification strata with population units: 360. • Variables of interest: value added and labour cost • the sample sizesof U1 and U2 partitions have been planned separately by means of a compromise allocation • the 2 allocations guarantee a CV of 34.5% for U1 and 8.7% for U2 with regard to the variables number of employers (supposed known at sampling stage); • the overall sample size is n=360

  24. 6. Application fields and experiments Real data The experiment examines a situation characterizing many real survey contexts in which the overall sample size n is fixed and the marginal sample sizes are determined by a quite simple rule being a compromise between the Allocation Proportional to Population size (APP) and the allocation uniform for each domain of a given partition: The probabilities of both designs for U1 and U2 partitions have been obtained as solution of the calibration problem below where the initial probabilities are set uniformly equal to

  25. 6. Application fields and experiments: Real data

  26. 7. Extension to the Fay Herriot Model 26

  27. 7. Extension to the Fay Herriot Model 27

  28. 7. Extension to the Fay Herriot Model 28

More Related