260 likes | 415 Views
THRio. Database Linkage and THRio Database Issues. Database matching. There are several systems that do not “talk” to each other SINAN – reportable diseases (TB, AIDS) SIM – Mortality SICOM – Pharmaceutical database (ARVs) THRio – Our DB Original plan
E N D
THRio Database Linkage and THRio Database Issues
Database matching • There are several systems that do not “talk” to each other • SINAN – reportable diseases (TB, AIDS) • SIM – Mortality • SICOM – Pharmaceutical database (ARVs) • THRio – Our DB • Original plan • Match THRio with all other 3 DBs above
Database matching • Problems • There is no unique identifier common for all systems • We use name, gender and DOB and mother’s name as surrogates • The information is not uniform – many missing variables – especially mother’s name • THRio • Standardization of names abbreviations • Double data entry • Not enough – names are misspelled • The other databases – even worse • No QC
Database matching • Proposed strategy • Compare different approaches • Translated SOUNDEX • Reclink – probabilistic linkage • Other algorithms • Apply to different examples and get sensitivity/specificity for each one • SICOM • Sequential matching • Match TB before doing the sequential
Database matching • The project was split: • ARV database revisited • Development of a new algorithm for database linkage
Database matching • ARV database revisited • Consistency problems (as pointed out before) • First HAART abstracted for THRio • Inconsistency confirmed • Dates did not match (40%) • Drugs did not match • Now all the ART history will be collected (since HAART only) • Should we insist and compare the database with the whole history?
Database matching • Development of algorithm for database linkage • Using Python to implement the interface • Adapted soundex algorithm • “Gestalt” algorithm – rather hyperbolic • Direct field comparisons • Including an hierarchical structure for searching and comparing records • Means taking advantage of differences in amount of information available • Computational problems • Optimization
Database matching • Blocking • Speeds up computation • I’ll be concerned with records that are a little similar to begin with • Soundex • First and last names • Mother’s first and last names • First name and mother’s last name • Needed to expand to account for errors in the first and last names’ first letter
Database matching • Full comparison • All fields exactly the same • Small error in DOB • Similar names (gestalt) – generates scores • A combination of the above • Several “levels” created • Have to choose 2 cutoffs • Not a match • Definitely a match • Have to manually decide
Database matching • Computational problems – testing phase • Using PostgreSQL and Python • Too slow when matching with the TB database • > 100,000 records • Changed the algorithm to Python only • Computational times (currently) • THRio x SIM (12,689 X 2,922) • 3-4 minutes • THRio x TB (12,689 X 102,919) • 100-105 minutes
Database matching • Results • First we chose a sample of the mortality database • Year 2005 • AIDS only • 871 records • Matched with THRio database • 10,344 records at the time
Database matching • Compared Manual x Reclink x Algorithm • We were going to use the manual linkage as the gold standard • The algorithm found 13 extra right matches • We used the combination of those as the standard
Database matching • The algorithm outperformed both RecLink and manual check • But after some adjustments • That was just the “training phase” • The only mistake has actually to be checked if it is a twin brother • Full info and only one different letter in the first name • We still have to test it again with a different sample and with TB
Database matching • THRio (latest) x SIM (2003-2005) • 340 matches (total) • 79 (23%) to be manually checked only • This means that both DBs have good quality, at lest in terms of completeness • Ended up with 273 matches and one possible mistake • When we actually implement it… • Extra check with date of last annotation in the chart
Database matching • Challenge • TB database • Data quality is much poorer than SIM • Might lead to lower sensitivity • Will lead to much more manual checking • Development of interface to help work
Database matching • THRio (latest) x TB (1995-2005) • 6453 matches (total) • 3870 (60%) to be manually checked • 721 (11%) with names only • Quality is much worse than SIM • Many duplicates • Proposed solutions: • Reduce time frame (for prospective TB cases only) • Use date of TB diagnosis to exclude duplicates • GUI to help
Database matching • Further discussion for mortality: • What database to use? • All causes X HIV-AIDS as a basic cause • Patients may be dying of other causes • Municipality X State • Patients may live in other cities • Municipality just records deaths that occurred in the city
Data analysis issues • Complex structure • Currently 17 tables with information • Dates are not date fields • We need dates!!! • We don’t collect information about specific visits • It is the information since last annotation up to the current one – could mean multiple visits • Definitions are hard to make
Data analysis issues • All the events have to be based on dates • Partial missing dates • In general I’ll accept missing days – turned to 15 • What to use as a surrogate? • For data collected under the study – date of last annotation • What about baseline data?
Data analysis issues • Definition of Baseline data • Study begins on September 1st 2005 • Baseline data collection finished on June 2006 • “Baseline form” doesn’t mean baseline information • Is it baseline for the study or for the patient? • What about new patients? Do they have baseline data?
Data analysis issues • Definition of a new patient • We have two “candidate” dates • Date of enrollment in the clinic • Could be long before HIV diagnosis • Date of HIV diagnosis • Could be long before enrollment in that clinic • A “new” patient is not necessarily new, depending on what we want • Do we need newly diagnosed or newly enrolled? • Should we use both?
Data analysis issues • Several possible outcomes • Primary outcome of study (TB) • Secondary outcome (death) • Operational outcomes • Waiting for PPD • PPD placed and read • Reactive PPD • INH started • How to deal with all of these?
Data analysis issues • General output for data analysis • For each patient, look for baseline status • As of Sept 2005 or at enrollment • Look for all changes in time • Need the dates!!! • Set up like a database for survival analysis • For every change repeat records with • Initial status • Initial date • Final status • Final date • Possible to customize for specific outcomes