420 likes | 432 Views
Learn about the novel integrated selective editing system used by Spain's National Statistical Institute to enhance the quality of data collected in the Farm Structure Surveys (FSS). This process combines selective editing with the Banff automatic system, providing valuable insights into the intricacies of data processing in the agricultural sector. Discover how this approach improves accuracy and reliability in agricultural data analysis.
E N D
Using Selective Editing Combined with an Automatic System in the FSS of SpainDolores LorcaNational Statistical Institute of Spain
Summary • An integrated editing process that combines selective editing and the generalized edit and imputation system Banff • We use Banff to detect the suspicious units and a score function for selective editing • Spanish FSS: the different types of data (crop, livestock, employment) contribute to the complexity of this process • Some results obtained from the traditional microediting approach and from selective editing are compared
Traditional microediting approach • The subject matter expert specifies the edits • The processing department makes tailored-made programs for each survey to detect the edit failures • The edit failures are manually reviewed
New integrated edit and imputation process 1) Initial editing prior to selective editing 2) Selective editing procedure 3) Automatic system process (BANFF)
1) Initial editing prior to selective editing Controls of consistence are established in the data collection phase carried out by interviewers
2) Selective editing procedure Score functions are built to determine and prioritize the survey suspect units to be reviewed manually due to their significant weight on the final estimates
3) Automatic system process (BANFF). The automatic system process is carried out using the generalized system Banff, developed by Statistics Canada
Study case:Farm Structure Survey (FSS) The Spanish FSS collects different types of data such as: • Utilised agricultural land • Cultivated Land by kind of crop • Types of livestock • The structure and the amount of farm employment • Machinery and equipment
The main characteristics of the FSS • FSS is carried out every 2 years • It consists of a farm panel drawn from the last Agrarian Census • The sample design is a single stage design with stratification of the farms according to geographical area, type of farming (TF) and size • Data collection is carried out by interviewers
FSS Estimators: total estimate of the jth variable in stratum h Fh is the sample weight for the stratum h nh is the sample size in stratum h Xhji denotes the jth variable value for the sampled unit i in stratum h.
Initial editing prior to selective editing • Initial editing is carried out by interviewers in the NSI’s provincial offices • In this phase, all fatal errors are corrected Most of these fatal errors come from balance edits
Selective editing procedure • The goal: To select the survey units with suspicious values that may have a significant effect on survey estimates • Key variable chosen: Utilized Agricultural Land (UAL), Cultivated Land (CL), Woody Crops (WCs), Olive Grove (OG),Vineyard (VY), Animal Units (AU), Annual Labour Units (ALU)
Selective editing: crop variables • Relative stability over time • Anomalous variations, from the previous year to the current one, can be a sign of data errors • We determine the units with anomalous and significant variations of the selected cropvariables
Steps of selective editing procedure: crop variables 1) In each stratum, we obtain the units with anomalous variations with respect to the previous period of the analyzed variables, using the Hidiroglou-Berthelot (1986) method of outlier detection (PROC OUTLIER of Banff system) 2)The units for manual editing are selected among the outliers identified previously having a significant weight on the population total estimates using a score function
(1) step: Hidiroglou-Berthelot method PROC OUTLIER of BANFF system
(1) step: Hidiroglou-Berthelot method Effect ehji for each unit i: ehji=shji(max(Fht-1xhjit , Fht-1xhjit-1))exp exp=1
(1) step: Hidiroglou-Berthelot method M, Q1,Q3: median, the first quartile and the third quartile of the transformed ehji values of the variable being processed dQ1=max(M-Q1,|A*M) dQ3=max(Q3-M,|A*M)
(1) step: Hidiroglou-Berthelot method (M-C dQ1 ,M+CdQ3) C=5
(2) step: scaled local score function (Latouche and Berthelot 1992):
Setting threshold value (Lawrence and Mckenzie 2000): ahj is the threshold value of the jth variable in stratum h, SE(Xhj) is the standard error of the jth variable in stratum h, nh is the sample size in stratum h and k is a value such as :
Using the Lawrence and Mckenzie formula ensures that the bias due to not editing some of the survey units is less than k% of the variance of the estimate. The value of k is set to 10%
Within eachs stratum,the values Δhji are sorted in descending order • Then, the outliers with score Δhji > ahj are selected for manual editing
Selective editing: employment variable • ALU variable • One ALU is equivalent to the work carried out by one person on a full-time basis over one year • Using auxiliary information to estimate the expected amended value:the ratio between the employment number in agriculture obtained in t and t-2 through the Force labour Survey (FLS)
Selective Editing: livestock variables • The FSS collects the existing livestock in the farm on the day of the interview • A farm can have a strong livestock variation depending on the interview date The selective editing procedure for livestock is different to the rest of variables
Animal Units (AU) Livestock data are expressed in AUs which are obtained by applying a coefficient to each species and type in order to group different species in one common unit
Steps of Selective Editing procedure: livestock 1) Units that fail some of the edits, which are specified in the traditional microediting approach, are selected as suspicious units 2) For each suspicious unit or edit failure, an estimate of the expected amended response of AU variable is calculated 3) We determine, among the suspicious units detected at the previous step, those units with a significant weight on the total estimate of the AU variable
Edits specified in the traditional microediting approach • yhji < chj • yhji is the jth variable (types of livestock) for the unit i in stratum h • chj is a constant determined by the historical empirical distributions
Estimate of the expected amended response of AU variable: • chj expressed in AU, i.e. x’hji • Magnitude of failure for the suspicious unit i • ehji=xhji-x’hji
The threshold is calculated using the Lawrence and Mckenzie formula as in previous cases • Within each stratum,the values Δhji are sorted in descending order • Then, the edit failures with score Δhji > ahj are selected for manual editing
Global score function Ghi=max j(Δhji )
Macroediting and selective editing approach • In first place, a selection of the strata with the largest variation with respect to the previous period of the analysed variables is carried out • After, the steps of selective editing procedure are applied only to the farms of the selected strata
threshold value for the strata • In each region, the hj values are sorted in descending order. • We determine a threshold value, j* and strata with hj >j* are selected This threshold value is set to 3%.
Results • Farm number: 3690 • We compare the results obtained for the following editing procedures: • (A) Traditional microediting approach • (B) Selective editing procedure • (C) Macroediting and selective editing approach
Table 2:Change rates of total estimate for the CL variable (%)
Further research • Banff will be applied to the rest of units that have not been edited in the selective editing procedure • Different methods of imputation will be tested
Final remarks • Integrating the PROC OUTLIER of Banff to detect suspicious units and a score function to select units for manually editing has been useful in the Spanish FSS • Reduction in cost and processing time would be attained using this approach • Response burden is reduced from carrying out less number of recontacts