310 likes | 471 Views
Methodology of Allocating Generic Field to its Details. Jessica Andrews Nathalie Hamel François Brisebois ICESIII - June 19, 2007. Outline. Background Information on Tax Data Objective Current Methodology Other Methodologies Considered Comparison of the Methodologies
E N D
Methodology of Allocating Generic Field to its Details Jessica Andrews Nathalie Hamel François Brisebois ICESIII - June 19, 2007
Outline • Background Information on Tax Data • Objective • Current Methodology • Other Methodologies Considered • Comparison of the Methodologies • Future Work and Conclusions
Tax Data • Statistics Canada receives annual data from Canada Revenue Agency (CRA) on incorporated (T2) businesses • Tax data: • Balance Sheet • Income Statement • 88 different Schedules
Tax Data • About 700 different fields to report • Most companies provide only 30-40 fields • Only 8 fields are actually required by CRA (section totals) • Non-farm revenue • Non-farm expenses • Farm revenue • Farm expenses • Assets • Liabilities • Shareholder Equity • Net Income/Loss
Objective • To impute the missing detail variables • Why ? • Tax data users need detailed data (tax replacement project (TRP)) • Different concepts and definitions between tax and survey data • A subset of details linked to the same generic can be mapped to different survey variables (Chart of Account)
Challenges to meet • Methodology must • Work well for a large number of details • Be capable of dealing with details which are rarely reported and those which are frequently reported • Give good micro results for tax replacement, but also give good macro results when examined at the NAICS or full database level
First attempt to complete Tax Data • Edit rules • Outlier detection within a record • Deterministic edits (to ensure the record balances within section) • Review and manual corrections • Overlap between fiscal period • Negative values • Consistency edits between tax variables • Outlier detection between records (Hidiroglou-Berthelot) • CORTAX balancing edits • Deterministic imputation of key variables • Inventories • Depreciation • Salaries and wages
GDA Concepts • Corporation can use either generic or detail fields to report their results
GDA Concepts • Block is defined by a generic and its details • Generic field is not a total • Goal is to impute the most significant detail variables when a generic amount has been reported • GDA: Generic to detail allocation
Current method • Uses imputation classes based on industry codes and size of company • First 2 digits of NAICS (about 25 industries) • Three sizes of revenue (boundaries of 5 and 25 million) • Calculates ratios within imputation classes for each block • Uses all non-zero and non-missing details • Uses only details reported at least 10% of the time (5% for block General Farm Expense) • Assigns ratios to businesses with a generic
Current method • Originally proposed as a solution with good macro (aggregate) results • Now need good micro (business) level results for TRP • Problems • Imputation classes are frequently not homogeneous in terms of distribution • A large number of small imputation classes
Other methods considered • Historic imputation method • Scores method • Cluster method
Historic imputation method • Assumes distributions of details are the same from one year to the next • Problems • A change in business strategies/properties will not be considered this way • Most businesses which report details in the previous year will report them also in the current year, leaving few businesses which could be imputed with this method (~5% on all blocks tested) • Requires use of another method for remaining businesses
Scores method • Uses response/non response models for each detail • Groups businesses into imputation classes on the basis of percentiles of response probability • Calculates ratios within imputation classes • Assigns ratios to businesses with a generic
Scores method Problems • Need to create a model for each detail • Difficult to resolve what to do in the case of blocks with many details (5 or more) which are frequently reported • This method was excluded due to it’s difficulty in coping with blocks with a moderate to large number of details
Cluster method • Divides businesses into imputation classes on the basis of response patterns to details • Uses clustering or dominant detail method • Uses discriminatory models (parametric or not) to assign businesses with generic to imputation classes • Calculates ratios within imputation classes • Assigns ratios to businesses with a generic
Cluster method • Problems • For certain blocks it can be difficult to find good variables on which to discriminate • Issue of how often clustering method and models should be reviewed
Comparing the methods • Estimate distributions of known data for year n from ratios calculated for year n-1 • Create a benchmark file • Reported details in years n-1 and n • Put all details into generic fields in yearn • Calculate ratios from businesses in year n-1 for all methods • Assign ratios to businesses in year n • Compare the results to the reported fields
Comparing the methods • Compare the results at the micro (businesses) and the macro (aggregate) levels • Compare true and estimated distributions
Comparing the methods • Macro statistics for the jth detail in the block
Comparing the methods • Micro Statistics • Median Pseudo CV for the jth detail and ith business in the block
Comparing the methods • Micro Statistics • Median Pearson Contingency Coefficient for the jth detail and ith business in the block • f values represent the marginal distributions • d2represents the degree of dependency (depends on n, r and c)
Comparing the methods • We show results for Block 8230: Other Revenue • This block has 20 details covering revenue distribution • Important for clients as used in many surveys • The scores method is not shown as it is difficult to implement with this many details
Cluster methodology • Most blocks use dominant detail (attractor) x clusters to define the imputation classes • A business i belongs to cluster j of attractor x where x>50 if where is the total value reported by business i in detail j. If this statement is not true for any detail then the business is assigned to cluster j+1.
Cluster methodology • Distribution ratios to details are calculated for each cluster • Discriminatory models are then created (nonparametric for most blocks) to assign businesses with a generic • Use variables on industry (NAICS), location (province), size (revenue, log revenue), details and totals of details in other blocks
Cluster methodology • Generic amounts are assigned to details in the following 3 ways • If generic amount and no details reported then ratios are assigned as calculated • If generic amount and all details with ratio greater than 0% are reported then ratios are assigned as calculated • If generic amount and some details but not all are reported, then ratios are pro-rated and generic is assigned only to details which were not reported
Cluster methodology • Gives better micro results • Improved data for tax replacement • Macro results remain similar to current methodology • Micro results are consistent year to year
Future work and conclusions • The cluster methodology will be implemented for reference year 2006 for the Income Statement • Model fitting and implementation for Balance Sheet will follow • Review of models and clustering methods as deemed appropriate
Contact Information / Coordonnées Jessica.andrews@statcan.ca Francois.brisebois@statcan.ca Nathalie.hamel@statcan.ca