1.06k likes | 1.4k Views
Objectives. Place record linkage in a broad framework for planning, analysis, and public health actionFocus on key issues in planning, implementation, evaluation, and utilization of record linkage projects with administrative public health databasesAvoid falling asleep listening to a boring presen
E N D
1. RECORD LINKAGE 201:VISION FOR DATA INTEGRATION TO ACTION AND IMPLEMENTATION Russell S. Kirby, Ph.D., M.S., F.A.C.E.
Department of Maternal and Child Health
School of Public Health
University of Alabama at Birmingham
2. Objectives Place record linkage in a broad framework for planning, analysis, and public health action
Focus on key issues in planning, implementation, evaluation, and utilization of record linkage projects with administrative public health databases
Avoid falling asleep listening to a boring presentation right after lunch
4. What is Record Linkage? If we assume there is a single record as well as a file of records and all records relate to some entities: persons, businesses, addresses, etc . . . Record linkage is the operation that, using the identifying information contained in the single record, seeks another record in the file referring to the same entity.
Ivan Felligi, Statistics Canada
5. A Long History Based on this definition, record linkage has been around for a long time!
In public health, modern methods date only back to the 1960s, and its broad use is truly a phenomenon of the 1990s into the present decade.
6. Population Health Informatics Record linkage should not be undertaken as an end unto itself.
Rather, projects should be done within a broad informatics context, with scientifically sound strategies. Data quality issues should be a paramount concern at all steps in the record linkage process.
Ideally, record linkage should be done within the context of a theoretical framework and a research study design.
8. Maternal and Child Health Most of our databases represent administrative data
Most of these data focus on aspects of disease processes or systems of care (traditional medical model)
While some of our databases are population-based, some are program-based (and by no means are all public health programs population-based)
9. COMPONENTS OF AN IDEALSTATEWIDE PERINATAL DATABASE1. Linkages relating to the index pregnancy
16. Elvis Presley on Love: You dont know what youve got,
until you LOSE it . . .
17. Kirby on Data in Databases You dont know what youve got,
until you USE it . . .
18. RECORD LINKAGE:Who, What, Why, When, Where, How? Which question is primary?
19. RECORD LINKAGE: Why? What is the purpose of the study?
Does a record linkage make sense?
would a simple numerator/denominator analysis suffice?
can the linkage be conducted in a manner that supports the use of the resultant database for other projects?
is a record linkage technically feasible?
is a record linkage necessary?
20. RECORD LINKAGE: How? Manual versus automated linkage
The theoretical basis for record linkage
deterministic methods
probabilistic methods
The need for identifiers
Record linkage with names and dates
Software: buy specialized, use statistical software package, develop your own?
Statistical evaluation of linkage results is imperative, regardless of the method
21. RECORD LINKAGE: Who? What personnel should do the linkage?
dedicated linkage specialists?
statisticians/programmers/analysts?
Should linkage staff be subjected to personality profiles?
What cases/events qualify for the linkage?
22. RECORD LINKAGE: What? What databases should be linked?
What are the functional relationships between the records in each of the candidate datasets? Are they sufficient to answer the research question?
How does the linkage support the programmatic/research needs for which the linkage was proposed?
Is there a plan for data warehousing or systematic data integration?
23. RECORD LINKAGE: Where? Where should the linkage be done?
statistical agency?
epidemiological agency?
university research center?
contract to vendor?
Dont forget the importance of spatial identifiers:
consider geocoding as another aspect of record linkage
24. RECORD LINKAGE: When? How often should linkages be done?
The periodicity of routine linkages is predicated on the programmatic need for timeliness and reporting, e.g.:
infant deaths: link immediately
hospital discharges and birth certificates: quarterly or annually may be appropriate
linkages to support impassive case-finding registries: periodicity defined by registry needs
25. With all this in mind . . . Lets review some perspectives from the experts on how to do record linkage with public health databases.
26. TOP TEN LISTTEN BEST WAYS TO DO BAD PUBLIC HEALTH RECORD LINKAGE
27. Just have someone else do the linkage for you, then use the dont ask, dont tell method perfected by the military. That way, what don't know doesn't hurt you! -- Anonymous correspondent, summer of 2002
31. Maam, you can have any color car you want, so long as its black -- Henry Ford, 1920s
39. KEY ISSUES Why link?
To link, or not to link? . . . or
I link, therefore I am?
Defining the nature of the problem
What is the purpose?
What do the records in each dataset represent?
What will we do with the results?
40. Why Link? (select the best answer) We cannot answer the research or policy question without linking the databases.
We have to under the terms of our grant or cooperative agreement.
Integrating record linkage into the routine data management process of our program enables us to assess the programs effectiveness and efficiency on a continual basis.
41. Why Not Link? (select the best answer) Lack of funding.
Staff dont have training.
Necessary hardware/software/data storage unavailable.
Bureaucratic inertia.
Turf battles between programs.
Question doesnt warrant linkage.
Some of the above?
All of the above?
43. First Steps Before conducting a record linkage, carefully examine the broad informatics, program and research context.
Above all else, consider the purpose of the linkage project in relation to the planned approach and other potential uses of the resulting linked dataset.
Hint: if you only ask the people on your team about other potential uses, the uses identified will mostly be within the same frame of reference for your own approach.
Lets carefully explore the question of linking birth certificates and Medicaid pregnancy claims data.
44. What do the records represent? Medicaid claims database
Pregnant women/mothers
Women who are not pregnant or may be pregnant (including the elderly)
Infants and children
Men (what a concept!)
Birth certificates
Live births
Fetal deaths
45. What do the records represent (continued)?
Some questions to consider:
What records are included in the claims database? Are there systematic exclusions (e.g. global bills for Medicaid managed care recipients)? Does the database include only paid claims?
Are there records in the Medicaid database that may not represent prenatal services?
Are there potentially multiple records per patient in the claims database?
What is the purpose of the linkage?
46. What do the records represent (continued)?
Some questions to consider:
Is the focus of the study on mothers, infants, or mother-infant dyads?
How do the concepts of residence and occurrence affect the likelihood that an event will be included in either database?
What is the relationship between Medicaid eligibility and utilization?
What a priori expectations are there concerning which records will and will not match?
47. Some possible purposes of the linkage Link all Medicaid-eligible pregnant women with their birth outcomes?
Link all Medicaid-paid deliveries with their birth certificates?
Link all Medicaid-eligible pregnant women with their infants (or all Medicaid-eligible infants with their mothers)?
Create a proxy measure for socio-economic status for vital statistics analyses?
Create Medicaid pregnancy episodes of care records?
Other purposes?
48. Issues with residence and occurrence in the context of linking Medicaid and vital statistics records Vital statistics datasets include all resident and occurrence events in the state thanks to the VSCP-NAPHSIS interstate exchange agreement. This includes live births, fetal deaths, deaths but does not extend to non-vital statistics records.
Medicaid program data are generally state-specific, and state residence is part of the eligibility requirement. A woman who gives birth in your state, but is a resident of another state, may have been a Medicaid participant there, but youll never know. Some states have special programs governing reimbursement for Medicaid services provided by physicians/health care facilities in other states.
If records fail to match, might it be due to differences in reporting requirements and eligibility?
49. What is the relationship between eligibility and utilization? Remember that eligibility data are just that unfortunately some persons who meet eligibility requirements never apply or get signed up, while others do, but never access the services for which they are eligible.
Consider linking your eligibility data with service utilization data not only to find out which clients actually used the program, but also for the insights you might gain from the utilization data themselves.
50. What records will and wont match? Some pregnancies involving Medicaid-eligible women or Medicaid pregnancy claims do not result in live births.
Fetal deaths
Spontaneous or induced abortions
Some Medicaid-eligible women may not have been residents of the study area at the time of the vital event.
Over-reliance on unique identifiers (SSNs, service IDs) can lead to both mis-matched and unmatched records.
Whats in a name, anyway?
51. Record Linkage Methods Generally there are two classes of linkage methodologies
Deterministic linkage methods
Probabilistic linkage methods
52. Linking data deterministically
53. Which variables are common to both datasets?? Do a PROC Contents*
54. A Word of Caution On the previous slide, mention was made of using SAS.
If you plan to do record linkage using Microsoft Access without complex Visual Basic code, DONT! The same applies to other relational database software.
Linkages based solely on straightforward JOINs will allow significant error to remain in your matched results.
55. An Even Stronger Word of Caution If you plan to conduct a deterministic match using a single identifying variable, or requiring a match on that variable together with others, DONT!
A good example of this is the Social Security Number.
On the other hand, once you have linked records, assigning a common identifier to both datasets will facilitate future data processing.
56. And now, back to our regular program . . .
57. Mothers information Birth certificate Newborn Screen
Birth_mom_legal = Screen_mom_legal_last
Birth_mom_mid = Screen_mom_mid
Birth_mom_first = Screen_mom_first
Birth_mother_dob = Screen_mother_dob
58. Infants information Birth Certificate Newborn Screen
Birth_child_last = Screen_child_last
Birth_child_mid = Screen_child_mid
Birth_child_first = Screen_child_first
_Birth_gender = _Screen_gender
Birth_child_dob = Screen_date
59. Other information There could also be related fields that dont specifically identify the individual, but are useful for record linkage:
Birth Certificate Newborn Screen
Birth_zip_code = Screen_zip_code
Birth_hosp = Screen_hosp
60. Missing data Look for missing data in linkage variables.
What do you do when you find it?
61. Duplicate records Look for records that share the same values for your vector of matching variables.
What do you do when you find records that share these values?
62. Ranking of linkage variables Which variables are the best variables?
How much missing data in each variable?
What do you know about the variables?
How do you decide?
63. The art of creating a linkage algorithm Use the most discriminating combination of variables first
Loosen criteria as go along
64. The art of creating a linkage algorithm
65. Create id in data set Allows you to easily merge back with original data
Easy as:
data new;
set old;
id=_n_;
run;
66. Sort by chosen linkage variables What happens when you dont use by variables??, for example:
DATA LINKED;
MERGE BCERT MED;
RUN;
Be sure you unduplicate the output file (ie NODUPKEY option in PROC SORT)
67. Merge by chosen linkage variables Create data set with only linked records
Keep track of the link level level of linkage where records matched
Dont discard records that fail to match at each step
Consider allowing full replacement prior to running each new iteration
68. Re-merge to get unlinked datasets Unlinked data sets contain only variables from that data set
Unlinked records sent to next level of linkage algorithm
69. Last step Combine all linked data sets
Investigate unlinked records
Look for systematic errors responsible for non-linking
Look for biases
Evaluate quality of links in linked records
70. Probabilistic Record Linkage Uses probabilities to determine whether a pair of records refer to the same individual
Calculates weights to quantify the likelihood that a pair of records are a true match
Computationally intensive each record in each dataset is compared with every other record in the other dataset
Probabilistic weights may be either non-specific or value specific
71. General (Non-Specific) Weights Agreement on a specific variable
Example:
- Agreement on date of birth receives a higher weight then match on sex
- Disagreement on sex receives a higher penalty than disagreement on date of birth
72. Value Specific Weights Agreement on a specific value of the variable being compared
Example: Comparing initials using value specific weights
- Agreement on initial Z receives higher weight than match on initial S
- Disagreement on initial S receives higher penalty than disagreement on Z
73. Benefit of Weights Weights objectively reflect our confidence in a match
Individual choice in cutting off low weights
74. Probabilistic Linkage Methods Some SAS programmers write their own probabilistic code
Software packages
- Very expensive
- Difficult to use
- Some applications are available as freeware or shareware
75. Choosing Probabilistic Software
76. Linkage Evaluation A significant advantage of probabilistic methods is that evaluation of the linkage results is an explicit step in the methodology.
The analyst must determine what level of tolerance will be applied for acceptance of a matched pair of records.
77. Document, Document, Document Even if you plan to remain in your current job for the next 30 years, the importance of careful documentation in programs, output, data dictionaries, and reports cannot be stressed strongly enough.
Retain statistical program logs, keep track of the provenance of input datasets, and document all decisions made concerning methods and their application.
78. Data Warehouses Be wary of warehouses, lest you fall into the trap of believing they are all things to all people.
More specifically
When linkages within the warehouse are made solely on the basis of unique identifiers, caveat emptor.
Always ask the question of how the linkages for the warehouse were done, and more importantly, for what purpose.
79. Data Warehouses (continued) The term data warehouse means different things to different people.
For some, its a perfect one-to-many/many-to-one linkage repository
For others, its a library of databases containing records of unknown or untested relationship to one another
For still others, it is a Swiss cheese data cube in which some regions are fully populated and linked across data sources, while others contain data measured at differing levels of aggregation, while others contain unlinked records, while still others are empty
80. And finally, one more time . . .
81. Evaluate before you analyze Dont assume the linkage has been done correctly, whether you did it yourself or it was done by someone else.
Each time the linkage is done the results must be evaluated, whether you use deterministic or probabilistic linkage algorithms.
Compare values on non-linkage variables as well as those used to conduct the linkage, across all observations in the dataset.
Create pairwise linkage scores and throw out linkages between records that dont meet your minimum criteria.
If you publish reports or submit manuscripts, it is imperative that information on how the linkage was done and how the results were evaluated prior to analysis be included in your methods.
83. But weve always done it this way . . . (or, close enough for government work) Why do the linkages once a year?
Consider building linkages into the routine processing of records as they are filed or reported.
Even if linkages are done annually, consider creating a database in which links across subjects can cross reporting years. This can result in a self-correcting feedback loop that enables additional unmatched records to be linked later on the basis of more current information.
84. THE TEN COMMANDMENTS OF RECORD LINKAGE
95. The life which is unexamined is not worth living - Plato (428-348 B.C.)
96. The database which is unexamined is not worth analyzing - Kirby, (1954- A.D.)
97. Contact Information Russell S. Kirby, PhD, MS, FACE
Department of Maternal and Child Health School of Public Health, University of Alabama at Birmingham
Email: rkirby@uab.edu
Telephone: 205-934-2985
99. What is reality?
100. CONTROLLING THE URGE TO MERGE:DIAGNOSIS AND TREATMENT OF A NEW CLINICAL PSYCHOSIS AFFECTING PUBLIC HEALTH WORKERS AND RESEARCHERS Russell S. Kirby, Ph.D., M.S., F.A.C.E.
Originally described Dec. 1996,
revised at UAB Nov. 2002
102. Impulse-Control Disorders Not Elsewhere Classified (269) 312.34 Intermittent Explosive Disorder (269)
312.32 Kleptomania (269)
312.33 Pyromania (270)
312.31 Pathological Gambling (271)
312.39 Trichotillomania (272)
312.35 Urge to Merge (272)
312.30 Impulse-Control Disorder NOS (272)
103. 312.35 Urge to Merge
A. Recurrent failure to resist impulses to link public health and/or clinical medical records that result in ill-conceived, often unscientific linkage strategies and linked files which may be inappropriate for the research purposes for which they were created.
B. The urge to merge manifested by researchers and analysts is often stimulated by external forces (administrators) but is grossly out of proportion to any precipitating bureaucratic stressors.
C. The urge to merge is not better accounted for by Conduct Disorder, Manic Episode, Substance Dependence, or Antisocial Personality Disorder.
104. Some clinical features of the urge to merge psychosis subject observed constantly mumbling about the need for a unique identifier
subject suffers from multiple tools disorders (see DSM-4R for diagnostic criteria), e.g.
if Access doesnt work, subject tries SAS
if direct importation doesnt work, subject converts files to spreadsheets first, then into statistical file formats
105. Some clinical features of the urge to merge psychosis (continued) subject given to making grandiose statements, e.g.
if you cant drill down, then roll up
linked files are data rich and information poor
electronic data rules; paper is for illiterates
subject often forgets why research projects are being done, as the linkage task becomes both primary and primal
106. If this is you . . . There is hope.
Join the national community of LA (Linkers Anonymous) and practice its iterative twelve-step algorithm.
Talk to your colleagues and co-workers in time, they may come to understand, or at least become more tolerant.
Remember, you dont have to go-it-alone!