650 likes | 812 Views
The Use of Administrative Sources for Statistical Purposes. Matching and Integrating Data from Different Sources. What is Matching?. Linking data from different sources Exact Matching - linking records from two or more sources, often using common identifiers
E N D
The Use of Administrative Sourcesfor Statistical Purposes Matching and Integrating Data fromDifferent Sources
What is Matching? • Linking data from different sources • Exact Matching - linking records from two or more sources, often using common identifiers • Probabilistic Matching - determining the probability that records from different sources should match, using a combination of variables
Why Match? • Combining data sets can give more information than is available from individual data sets • Reduce response burden • Build efficient sampling frames • Impute missing data • To allow data integration
Models for Data Integration • Statistical registers • Statistics from mixed source models • Split population model • Split data approach • Pre-filled questionnaires • Using administrative data for non-responders • Using administrative data for estimation • Register-based statistical systems
Mixed Source Models • Traditionally one statistical output was based on one statistical survey • Very little integration or coherence • Now there is a move towards more integrated statistical systems • Outputs are based on several sources
Split Population Model • One source of data for each unit • Different sources for different parts of the population
Split Data Approach • Several sources of data for each unit
Pre-filled Questionnaires • Survey questionnaires are pre-filled with data from other sources where possible • Respondents check that the information is correct, rather than completing a blank questionnaire • This reduces response burden ...... but may introduce a bias!
Using Administrative Data for Non-responders • Administrative data are used directly to supply variables for units that do not respond to a statistical survey • Often used for less important units, so that response-chasing resources can be focused on key units
Using Administrative Data for Estimation • Administrative data are used as auxiliary variables to improve the accuracy of statistical estimation • Often used to estimate for small sub-populations or small geographic areas
Matching Keys • Data fields used for matching e.g. • Reference Number • Name • Address • Postcode/Zip Code/Area Code • Birth/Death Date • Classification (e.g. ISIC, ISCO) • Other variables (age, occupation, etc.)
Distinguishing Power 1 • This relates to the uniqueness of the matching key • Some keys or values have higher distinguishing powers than others • High - reference number, full name, full address • Low - sex, age, city, nationality
Distinguishing Power 2 • Can depend on level of detail • Born 1960, Paris • Born 23 June 1960, rue de l’Eglise, Montmartre, Paris • Choose variables, or combinations of variables with the highest distinguishing power
Match • A pair that represents the same entity in reality A A
Non-match • A pair that represents two different entities in reality A B
Possible Match • A pair for which there is not enough information to determine whether it is a match or a non-match A a
False Match • A pair wrongly designated as a match in the matching process (false positive) A = B
False Non-match • A pair which is a match in reality, but is designated as a non-match in the matching process (false negative) A A
Clerical Matching • Requires clerical resources • Expensive • Inconsistent • Slow • Intelligent
Automatic Matching • Minimises human intervention • Cheap • Consistent • Quick • Limited intelligence
The Solution • Use an automatic matching tool to find obvious matches and no-matches • Refer possible matches to specialist staff • Maximise automatic matching rates and minimise clerical intervention
Standardisation • Generally used for text variables • Abbreviations and common terms are replaced with standard text • Common variations of names are standardised • Postal codes, dates of birth etc. are given a common format
Blocking • If the file to be matched against is very large, it may be necessary to break it down into smaller blocks to save processing time • e.g. if the record to be matched is in a certain town, only match against other records from that town, rather than all records for the whole country
Blocking • Blocking must be used carefully, or good matches will be missed • Experiment with different blocking criteria on a small test data set • Possible to have two or more passes with different blocking criteria to maximise matches
Parsing • Names and words are broken down into matching keyse.g. Steven Vale stafan val Stephen Vael stafan val • Improves success rates by allowing matching where variables are not identical
Scoring • Matched pairs are given a score based on how closely the matching variables agree • Scores determine matches, possible matches and non-matches
How to DetermineX and Y • Mathematical methodse.g. Fellegi / Sunter method • Trial and Error • Data contents and quality may change over time so periodic reviews are necessary
Enhancements • Re-matching files at a later date reduces false non-matches (if at least one file is updated) • Link to data cleaning software, e.g. address standardisation
Matching Software • Commercial products e.g. SSAName3, Trillium, Automatch • In-house products e.g. ACTR (Statistics Canada) • Open-source products e.g. FEBRL • No “off the shelf” products - all require tuning to specific needs
Internet Applications • Google (and other search engines) • www.google.com • Cascot – an automatic coding tool based on text matching • http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/choose_classificatio/ • Address finders e.g. Postes Canada • http://www.postescanada.ca/tools/pcl/bin/advanced-f.asp
Software Applications • Trigram method applied in SAS code (freeware) for matching in the Eurostat business demography project • Similar approach in UNECE “Data Locator” search tool • Works by comparing groups of 3 letters, and counting matching groups
Trigram Method • Match “Steven Vale” • Ste/tev/eve/ven/en /n V/ Va/Val/ale • To “Stephen Vale” • Ste/tep/eph/phe/hen/en /n V/ Va/Val/ale • 6 matching trigrams • And “Stephen Vael” • Ste/tep/eph/phe/hen/en /n V/ Va/Vae/ael • 4 matching trigrams • Parsing would improve these scores
Matching Records Without a Common IdentifierThe UK Experience by Steven Vale (Eurostat / ONS) and Mike Villars (ONS)
The Challenge • The UK statistical business register relies on several administrative sources • It needs to match records from these different sources to avoid duplication • There is no system of common business identification numbers in UK
The Solution • Records are matched using business name, address and post code • The matching software used is Identity Systems / SSA-NAME3 • Matching is mainly automatic via batch processing, but a user interface also allows the possibility of clerical matching
Batch Processing 1 • Name is compressed to form a namekey, the last word of the name is the major key • Major keys are checked against those of existing records at decreasing levels of accuracy until possible matches are found • The name, address and post codes of possible matches are compared, and a score out of 100 is calculated
Batch Processing 2 • If the score is >79 it is considered to be a definite match • If the score is between 60 and 79 it is considered a possible match, and is reported for clerical checking • If the score is <60 it is considered a non-match
Clerical Processing • Possible matches are checked and linked where appropriate using an on-line system • Non-matches with >9 employment are checked - if no link is found they are sent a Business Register Survey questionnaire • Samples of definite matches and smaller non-matches are checked periodically
Problems Encountered 1 • “Trading as” or “T/A” in the namee.g. Mike Villars T/A Mike’s Coffee Bar, Bar would be the major key, but would give too many matches as there are thousands of bars in the UK. • Solution - split the name so that the last word prior to “T/A” e.g. Villars is the major key, improving the quality of matches.
Problems Encountered 2 • The number of small non-matched units grows over time leading to increasing duplication • Checking these units is labour intensive • Solutions • Fine tune matching parameters • Re-run batch processes • Use extra information e.g. legal form / company number where available
Future Developments • Clean and correct addresses prior to matching using “QuickAddress” and the Post Office Address File • Links to geographical referencing • Business Index - plans to link registers of businesses across UK government departments • Unique identifiers?