410 likes | 559 Views
Statistics Canada’s Small Area Estimation Product: BUPF 1.0 (Best Unbiased Prediction via Filtering). SAE-SPORD Project Team Statistics Research and Innovation Division Statistics Canada, Ottawa (for presentation to FLMM_LMIWG Workshop on Oct 17, 2007, Vancouver, BC).
E N D
Statistics Canada’s Small Area Estimation Product: BUPF 1.0(Best Unbiased Prediction via Filtering) SAE-SPORD Project Team Statistics Research and Innovation Division Statistics Canada, Ottawa (for presentation to FLMM_LMIWG Workshop on Oct 17, 2007, Vancouver, BC)
Project: SAE-SPORD(Small Area Estimation for Statistical Product Oriented R&D) Team: Avi Singh (Project Leader) François Verret Claude Nadeau Pin Yuan Acknowledgments: Meth Res Block Fund, Labour Stat Div, FLMM-LMIWG
Outline 1. SAE: Introduction 2. SAE: Visual Depiction 3. Product BUPF: Description 4. BUPF Application to Labour Force Survey 5. BUPF Demonstration (GUI Sample Screen-shots) 6. Concluding Remarks and Future Work
1. SAE: Introduction • Direct estimates for small areas (or domains) not reliable; e.g., for provinces, annual LFS estimates of Managers in Manufacturing and Utilities (a three-digit occupation code A39) are not reliable. Here provinces could be deemed as small areas. • Data Requirements: Provincial estimates of employment by 3-digit occupation codes
1. SAE: Introduction …cont. • Need more sample to get more reliable estimates • A cost effective alternative-- use a model such as the common mean model; e.g., the proportion employed in A39 is common across provinces • Quality of estimates depends on the validity of the model.
1. SAE: Introduction …cont. • Model provides an indirect (or synthetic) estimate at the area level. • For the common mean model, multiply the national total by the provincial population proportion to get indirect the estimate, e.g., for NL • 1.7% times 92,734 = 1582
1. SAE: Introduction …cont. • A combination of the two estimates ( direct and indirect) may provide a reasonable estimate with adequate precision depending on the level of small area. • The direct estimate is not precise but unbiased, while the indirect estimate is generally precise but not unbiased.
1. SAE: Introduction …cont. • SAE combines the direct and the indirect in an optimal way: • SAE for Area d = (shrinkage factor for d) x (direct Estimate for d) + (1- shrinkage factor for d) x (indirect estimate for d) • If the shrinkage factor is 10%, then only 10% of direct and 90% of indirect are used for SAE. If it is 50%, then both direct and indirect have equal say in compositing the two for SAE.
1. SAE: Introduction …cont. • The relative size of the shrinkage factor depends on variability in modeling error (in the indirect estimate) and sampling error (in the direct estimate). • Effective sample size for SAE is more than that for the direct estimate.
1: SAE: Introduction (Modeling Requirements) • Direct estimates from other small areas (termed indirect data) needed for modeling purposes; i.e., for predicting estimate for the area of interest. • Need enough small areas for adequate modeling. Subdivide provinces into subprovincial areas: • ER or ER by age by gender instead of province although it is the province level that is of interest.
1: SAE: Introduction (Modeling Requirements) • Beneficial to have an Auxiliary Information Source (Administrative/ Census): need true population totals at the area level for all areas. • Using auxiliary source can improve modeling with the indirect data.
1. SAE: Introduction (Modeling Requirements…cont.) • Examples of Auxiliary Information for LFS Application Administrative Source • Number of employment beneficiary claims at the area level • Number with employment income Population Census based demographic projections • Subpopulation counts
1: SAE: Introduction (Modeling Requirements) • The model predictor based on indirect data and auxiliary data provides an indirect estimate for the area of interest. • The model can be simple such as the common mean model which doesn’t use any auxiliary data or can be advanced.
1: SAE: Introduction (Modeling Requirements) • All indirect estimates are biased but bias can be low if model is good. • Combining direct and indirect estimates gives rise to estimates more precise than either one. • Benchmarking (Sum of small area total estimates within a subgroup of areas equals the direct estimate of the subgroup) helps in reducing model bias.
SAE: Introduction (User Concerns) • Detailed area-level requirements may vary from user to user. • However, cannot go to a very low level for two reasons: precision of SAEs may not be adequate, and auxiliary data may not be available. • Bias concerns due to use of indirect estimates for borrowing information; models may not be perfect but one chosen with care may be useful. • SAE methodology involves a trade-off between bias and precision
SAE: Introduction (User Concerns…cont.) • External validation of SAE; can be done periodically using census. • Also, validation by ‘local area’ knowledge • Confidentiality concerns ( this may or may not be a problem because smaller the area, more the error in SAE; built-in protection)
2. SAE: A Visual Depiction For Employment in A39 • However, with the usual SAE model the overall total is not preserved!
2. SAE: A Visual Depiction...cont. For Employment in A39 • Benchmarking ensures that the total stays the same after modeling
3. Product BUPF: Description • STC’s SAE product based on the client need identification (re: SAE Workshop in Feb ’05,see www.flmm-lmi.orgfor proceedings) • Main Features • Menu-driven software system • Sampling design is fully taken into account • Self-benchmarking for protection against model breakdowns • Area collapsing to include areas with no or few observations in the modeling process • Extensive model diagnostics and evaluation of estimates • Existing software (such as SAS PROC MIXED, MLwiN, WinBUGS) are not satisfactory
3. Product BUPF 1.0: Description • Part I : Data Preparations • Part II: Modeling Preparations • Part III: Model Selection and Diagnostics • Part IV: Small Area Estimation and Evaluation • Part V: Summary Report
4. BUPF Application to LFS • Empirical results presented here are still not final. • Two Main components of the product • Modeling component (for increasing effective sample size) • Estimation Component ( combining direct and indirect)
4. BUPF Application to LFS…cont • Model: Direct Estimate for Area d = True value + sampling error • True Value= Predictor + Model error • Predictor = x1β1+ x2β2+…; it gives rise to indirect or synthetic estimates. • X-variables considered: # reported income, # employment beneficiary, age-sex counts, etc. all at the small area level
5. STC’s SAE Product Demonstration BUPF 1.0 Demo
6. Concluding Remarks and Future Work • Several unique features in the BUPF product for SAE such as self-benchmarking, domain collapsing for nonsampled domains, and extensive diagnostics. • The Graphical User Interface (GUI) for the product is useful as a systematic checklist or as a virtual analyst for efficient production; also useful for training and product demonstration.
6. Concluding Remarks and Future Work • Complete beta-version of BUPF 1.0; current version is only alpha or a prototype and is not suitable for production. • Plan for validation study with Census 2006.
For more information, please contact avi.singh@statcan.ca Thank you…Merci
Appendix Product BUPF 1.0: Detailed Description
A1. Product BUPF 1.0: Description • Part I : Data Preparations • M1 : Data Specification • M2 : Task Specification • The definition of Small Area Modeling domains (SAM domains) is very important • Direct estimates, population counts and auxiliary data must be available at this level • # of SAM domains should be high enough for proper modeling • Here, SAM domain = ER(73) by Age(4) by Gender(2)
A2. Product BUPF 1.0: Description • Part II : Modeling Preparations • M3 : Benchmark Constraints & Baseline Model • Self-benchmarking is important to protect against model breakdowns as no model is perfect • Option: No BC, Global BC, Regional BC • M4 : Domain Collapsing • Improved alternative to leaving small sample size SAM domains outside of the model • M5 : Variance Smoothing
A3. Product BUPF 1.0: Description • Part III : Model Selection and Diagnostics • M6 : Model Selection • Standard Forward and Backward procedures implemented • M7 : Variance Component • Needed to find the proper shrinkage to move indirect to direct • M8 : Innovation Sequence • Makes it possible to diagnose the model with standard “iid N(0,1)” error tests • M9 : Model Diagnostics • Residual Plots, QQ-plots, R-square, Chi-square test for overdispersion and for model adequacy…
A4. Product BUPF 1.0: Description • Part IV : Small Area Estimation and Evaluation • M10 : Small Area Estimation • M11 : Evaluation of Estimates • Check for relative difference between direct and SAE • Other measures
A5. Product BUPF 1.0: Description • Part V : Summary Report • M12 : Overall Summary • Sampling Design and Data Sources (Part I) • Input Diagnostics (Part II) • Modeling Diagnostics (Part III) • Ouput Diagnostics (Part IV)