THRio

THRio Database Linkage and THRio Database Issues

Database matching • There are several systems that do not “talk” to each other • SINAN – reportable diseases (TB, AIDS) • SIM – Mortality • SICOM – Pharmaceutical database (ARVs) • THRio – Our DB • Original plan • Match THRio with all other 3 DBs above

Database matching • Problems • There is no unique identifier common for all systems • We use name, gender and DOB and mother’s name as surrogates • The information is not uniform – many missing variables – especially mother’s name • THRio • Standardization of names abbreviations • Double data entry • Not enough – names are misspelled • The other databases – even worse • No QC

Database matching • Proposed strategy • Compare different approaches • Translated SOUNDEX • Reclink – probabilistic linkage • Other algorithms • Apply to different examples and get sensitivity/specificity for each one • SICOM • Sequential matching • Match TB before doing the sequential

Database matching • The project was split: • ARV database revisited • Development of a new algorithm for database linkage

Database matching • ARV database revisited • Consistency problems (as pointed out before) • First HAART abstracted for THRio • Inconsistency confirmed • Dates did not match (40%) • Drugs did not match • Now all the ART history will be collected (since HAART only) • Should we insist and compare the database with the whole history?

Database matching • Development of algorithm for database linkage • Using Python to implement the interface • Adapted soundex algorithm • “Gestalt” algorithm – rather hyperbolic • Direct field comparisons • Including an hierarchical structure for searching and comparing records • Means taking advantage of differences in amount of information available • Computational problems • Optimization

Database matching • Blocking • Speeds up computation • I’ll be concerned with records that are a little similar to begin with • Soundex • First and last names • Mother’s first and last names • First name and mother’s last name • Needed to expand to account for errors in the first and last names’ first letter

Database matching • Full comparison • All fields exactly the same • Small error in DOB • Similar names (gestalt) – generates scores • A combination of the above • Several “levels” created • Have to choose 2 cutoffs • Not a match • Definitely a match • Have to manually decide

Database matching • Computational problems – testing phase • Using PostgreSQL and Python • Too slow when matching with the TB database • > 100,000 records • Changed the algorithm to Python only • Computational times (currently) • THRio x SIM (12,689 X 2,922) • 3-4 minutes • THRio x TB (12,689 X 102,919) • 100-105 minutes

Database matching • Results • First we chose a sample of the mortality database • Year 2005 • AIDS only • 871 records • Matched with THRio database • 10,344 records at the time

Database matching • Compared Manual x Reclink x Algorithm • We were going to use the manual linkage as the gold standard • The algorithm found 13 extra right matches • We used the combination of those as the standard

Database matching

Database matching • The algorithm outperformed both RecLink and manual check • But after some adjustments • That was just the “training phase” • The only mistake has actually to be checked if it is a twin brother • Full info and only one different letter in the first name • We still have to test it again with a different sample and with TB

Database matching • THRio (latest) x SIM (2003-2005) • 340 matches (total) • 79 (23%) to be manually checked only • This means that both DBs have good quality, at lest in terms of completeness • Ended up with 273 matches and one possible mistake • When we actually implement it… • Extra check with date of last annotation in the chart

Database matching • Challenge • TB database • Data quality is much poorer than SIM • Might lead to lower sensitivity • Will lead to much more manual checking • Development of interface to help work

Database matching • THRio (latest) x TB (1995-2005) • 6453 matches (total) • 3870 (60%) to be manually checked • 721 (11%) with names only • Quality is much worse than SIM • Many duplicates • Proposed solutions: • Reduce time frame (for prospective TB cases only) • Use date of TB diagnosis to exclude duplicates • GUI to help

Database matching • Further discussion for mortality: • What database to use? • All causes X HIV-AIDS as a basic cause • Patients may be dying of other causes • Municipality X State • Patients may live in other cities • Municipality just records deaths that occurred in the city

Data analysis issues

Data analysis issues • Complex structure • Currently 17 tables with information • Dates are not date fields • We need dates!!! • We don’t collect information about specific visits • It is the information since last annotation up to the current one – could mean multiple visits • Definitions are hard to make

Data analysis issues • All the events have to be based on dates • Partial missing dates • In general I’ll accept missing days – turned to 15 • What to use as a surrogate? • For data collected under the study – date of last annotation • What about baseline data?

Data analysis issues • Definition of Baseline data • Study begins on September 1st 2005 • Baseline data collection finished on June 2006 • “Baseline form” doesn’t mean baseline information • Is it baseline for the study or for the patient? • What about new patients? Do they have baseline data?

Data analysis issues • Definition of a new patient • We have two “candidate” dates • Date of enrollment in the clinic • Could be long before HIV diagnosis • Date of HIV diagnosis • Could be long before enrollment in that clinic • A “new” patient is not necessarily new, depending on what we want • Do we need newly diagnosed or newly enrolled? • Should we use both?

Data analysis issues • Several possible outcomes • Primary outcome of study (TB) • Secondary outcome (death) • Operational outcomes • Waiting for PPD • PPD placed and read • Reactive PPD • INH started • How to deal with all of these?

Data analysis issues • General output for data analysis • For each patient, look for baseline status • As of Sept 2005 or at enrollment • Look for all changes in time • Need the dates!!! • Set up like a database for survival analysis • For every change repeat records with • Initial status • Initial date • Final status • Final date • Possible to customize for specific outcomes

Thank you!

THRio

THRio

Presentation Transcript

THRio Objectives

Tuberculosis, HAART Use and Survival in THRio Cohort, Rio de Janeiro, Brazil

CREATE Biostatistics Core THRio Statistical Considerations

CREATE Biostatistics Core THRio Statistical Considerations

THRio