270 likes | 495 Views
Q2010 Helsinki. Integrating databases over time: what about representativeness in longitudinal integrated and panel data ?. Silvia Biffignandi , Bergamo University Alessandro Zeli, Istat. silvia. biffignandi @unibg.it. zeli @ista.it. Outline.
E N D
Q2010 Helsinki Integrating databases over time: what about representativeness in longitudinal integrated and panel data? Silvia Biffignandi, Bergamo University Alessandro Zeli, Istat silvia. biffignandi @unibg.it zeli @ista.it Biffignandi Silvia- Zeli Alessandro
Outline • The problem and the research objectives • Description of two longitudinal databases • Quality analyses of these database • Conclusions Biffignandi Silvia- Zeli Alessandro
The problem and the researchobjectives • NSIsusuallycarry out business surveys at differentpoint in timeusingdifferentsamplesfordifferentsurveysaswellasdifferentsamplesovertime • Users need more and more statistical information • New strategies required Biffignandi Silvia- Zeli Alessandro
The problem and the researchobjectives • User needs for longitudinal data • understanding aggregate changes in a variable, such as employment rate, over time • b) studying the time-varying economic characteristic (such as employment) of an individual Biffignandi Silvia- Zeli Alessandro
……….and the researchobjectives • Our task: • construction of two longitudinal databases, based on various sources and on different criteria • to verify the consistency between estimates based on the databases and population data Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • IDB ( technically integrated database) • panel Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • microdata • target population: enterprises with 20 employees or more • (40% in terms of employment and 60% in terms of value added) • variables : balance sheet data; SBS regulation data • period: 1998-2004 Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • 1) IBD (technically integrated database) • IDB data by sources (source percentage ) – Years 1998-2004 • Codesdescription • onlyBIL i(ril = 9) • PMI non respondents , but data integratedby BIL source (bil=5) • SCI non respondents , but data integratedby BIL source(ril=3) • SCI non respondents , butdonorimputation (ril=2) • SCI respondents(ril=1) • PMI respondents(ril=0) Years Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • 2) Panel • a catch-up panel database • it takes business transformation into account • integrity criterion, i.e. all variables in the panel have to be present for • all enterprises in the whole panel period Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • 2) Panel Step 1: enterprises (with at least 20 persons employed) respondent to SCI-PMI surveys in the starting year (1998) + all enterprises with at least 100 employees (even if non respondents) ifthe BIL source is available (integration); Step2: continuity criterion to the previousenterprises; Step 3: • persistence criterion ( i.e. respondents in 1998 or have data in the BIL for at least 4 years are includedin the panel) Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases • 2 )Panel Biffignandi Silvia- Zeli Alessandro
Quality analyses Verify the equality of the population structure into the different database (especially the sectoral composition). We apply two different approaches: • the statistical analysis of difference between the distributions of some important variables in IDB/panel and universe; • an index of representativeness related to main categorical variables. Biffignandi Silvia- Zeli Alessandro
Quality analyses • Difference between distributions 1.a) • 1.a) Spearman’s ranks correlation • for distributions of value added, persons employed and turnover values of economic divisions, years 1998 – 2004: • in IDB and universe (minimum 95,2 – maximum 99,8) • in panel and universe (minimum 90,9- maximum 97) • In both situation correlation is very high in each year. • Only an ordinal ranking on the relative changes among the economic divisions. Biffignandi Silvia- Zeli Alessandro
Quality analyses • Difference between distributions 2.a) 2.a)Fligner-Policello test of stochastic equalitydistributions of shares of turnover, value added and employment by divisions of economic activities – Years 1992-2004 • IDB vs Universe • Panel vs Universe. • Test not significant for all variables for all years in the panel. Biffignandi Silvia- Zeli Alessandro
Quality analyses • Representativeness indexes 2.) • R-indexes (representativeness)- RISQ project • (see for instance, Schouten and Cobben, 2007; Shlomo et al. 2009). • Support for the quality comparison of different surveys or register to compare the response: • to different surveys that share the same target population • to a survey during data collection • to a survey longitudinally Biffignandi Silvia- Zeli Alessandro
a) Weakrepresentativity R2 indicator i.e. the average response propensity over the categories is constant Quality analysesRepresentativeness indexes 2.) Response probability estimationrequired: usually logistic regression model In our study auxiliary variables are: • industrial division as classified in the NACE Rev1.1 (2 digit sectors) • 3 sizeclasses: 20 to 49, 50 to 249 and 250 and over persons employed). Biffignandi Silvia- Zeli Alessandro
Quality analysesRepresentativeness indexes b) b) If X is an auxiliary variable with H classes a marginal indicator (MR2)proposed is • = centralised regression parameter for category h Biffignandi Silvia- Zeli Alessandro
Quality analysesRepresentativeness indexes b) R2index and lowerbound Biffignandi Silvia- Zeli Alessandro
Quality analysesRepresentativeness indexes b) Marginal R-index (panel years 1998,2001, 2004) by enterprise size Biffignandi Silvia- Zeli Alessandro
Quality analysesRepresentativeness indexes b) Marginal R-index for section of economic (panel years 1998, 2001, 2004) Biffignandi Silvia- Zeli Alessandro
Quality analysesRepresentativeness indexes 2.) • R2 indexes are very high in each year included in the panel • marginal R-index remains essentially the same with a slight decrease over the period. • a overrepresentation of medium-large enterprises (with 50 persons employed and over) • small enterprises (between 20 and 49 persons employed) are underrepresented • service sector is underrepresented • Summing up: • quite confident that the level of representativeness is appropriate in the global context. Biffignandi Silvia- Zeli Alessandro
Concluding remarks - the use of administrative data integration is promising for longitudinal database construction - IBD and panel estimates are satisfactory - the panel allows for gain of information at reasonable cost/effort resources: for instance the grouping of enterprise according to classifications selected by the user (gazelle, or best performer) other then the ordinary classification utilised in the sample design (economic activity, size, geographical area)Further research- more representativity indicator analyses- panel update criteria Biffignandi Silvia- Zeli Alessandro
Thankyouforyouattention!! Biffignandi Silvia- Zeli Alessandro
R-indexes (representativeness) - RISQ project Shouten and Cobben (2007) is a selection indicator that takes value 1 if the unit is selected in the sample and 0 otherwise first order inclusion probability weakrepresentativity Weakrepresentativity R2 indicator: a response subset is representative for a categorical variable X with H categories if the average response propensity over the categories is constant: s where Nh is the population size of category h, rh,k is the response propensity of unit k in class h and summation is over all units in this category. Biffignandi Silvia- Zeli Alessandro
The response probability R-indexes (representativeness)Shouten and Cobben (2007)RISQ project Auxiliary variables in our study are: • industrial division as classified in the NACE Rev1.1 (2 digit sectors) • 3 sizeclasses: 20 to 49, 50 to 249 and 250 and over persons employed). Biffignandi Silvia- Zeli Alessandro
Biaswhere R-indexes (representativeness)Shouten and Cobben (2007) • where Biffignandi Silvia- Zeli Alessandro
Fligner-Policello test • Task • verify the equality of the distributions for IDB or panel and universe of the relative shares of the economic divisions of the three variables considered with respect to the totals for each years. • Fligner-Policello test • no assumptions • no on normality • no equal variances, • not that the two distribution have a similar shape. I • It is a test of stochastic equality between two distributions, rejection of the null means that the two distribution are different in probability. • If the null hypothesis is rejected the sign of F-P statistic points out which of the two distributions is dominant: a positive sign means that panel shares have an higher probability to take greater values wit respect to the population. Biffignandi Silvia- Zeli Alessandro