Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene

Implementation of Probabilistic Matching in NYC Chronic Hepatitis B and NYC A1C Registries, and Implications Towards an MPI Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene October 17th, 2008 Integrated Surveillance Seminar

Overview • Describe data quality challenges in disease surveillance • Describe probabilistic matching techniques • Implementation of probabilistic matching • NYC Chronic Hepatitis B Registry (LVR) • NYC Hemoglobin A1C Registry (NYCAR) • NYC proposed challenges and benefits of an MPI

Public Health Surveillance • Public health surveillance process includes: • Collection of Data on a specific disease or condition via standardized information systems • Analysis and interpretation the data • Dissemination of information to individuals who can act on it • Utilization of information to facilitate necessary response that will effectively deal with the public health issue

Surveillance Data Quality Issues • Accuracy • Non-standardized across different data sources • Multiple laboratory systems • De-duplication of reports • Exact duplicates • Multiple events linked to a unique person • Non-relevant information

Impact of Data Quality Issues in Surveillance • Impacts on surveillance reporting • Over or underestimates of true cases • Geographical misrepresentation (missing address) • Increases costs • Additional staff required to address data quality issues • Increases inefficiencies • Timeliness for patient or provider follow up

Addressing Data Quality Challenges Modern disease surveillance information systems: • Validates data at time of collection • Minimize inaccurate or incomplete data • Standardizes different data to uniform structure • Integrates matching technology to create • Patient indexes (person-centric systems vs event-centric systems) • Providers indexes • Facility indexes

What is Probabilistic Matching? • Rule based match algorithms • Standardizes Data • Parses data into smaller tokens • Create fields that enhance matching • Adapt to specific data - incorporates uniqueness or frequency of data values when comparing records • Processes data in blocks – viable to use on large volume data sets

Evaluating Match Algorithm • Outcome of a potential match is a weight or likelihood that 2 records are the same entity • Surveillance programs identify thresholds for match algorithm • Prior to reviewing results of match algorithm: • Identify implications for precision (PPV) vs negative predictive valuen (NPV) • Evaluation of health code mandate • Practical issues • Surveillance reporting • Identify guidelines or criteria to review matches

Identifying Thresholds • Goal: maximize precision or PPV • Sacrifice on negative predictive value (NPV) • Surveillance programs can decide to review ambiguous matches Therefore - set high thresholds

Outcome of Probabilistic Matching Entity-centric, relational registry system

Background of Hepatitis B in NYC • Decline in acute Hepatitis B incidents case rates (per 100,000 persons) from 11.5 in 1985 to 1.6 in 2006 • In NYC burden of chronic Hepatitis B infection as much as 2x higher within specific populations • MSM • IDU • Persons born in regions where HBsAg prevalence >2% • Need for continued surveillance and monitoring Source: recommendations for identification and public health management of persons with chronic Hepatitis B infection http://www.cdc.gov/mmwr/preview/mmwrhtml/rr5708a1.htm

Hepatitis B Surveillance Activities • Monitor disease trends • Aggregate descriptive reporting aimed to guide prevention and intervention efforts • Outreach with newly infected • Educational materials to new cases reported to the registry

NYC Hepatitis B Registry • Legacy application, built in-house in 1999 • Automatic weekly batch uploads of laboratory reports • Data entry of provider reports • System did not index on patients (event-based), could not link 2 reports for the same person. • Program utilized staff to build and apply deterministic match algorithms • Resource intensive • Version control

NYC Liver Virus Registry (LVR) • Implemented in October 2008, built in-house • Migrated all legacy data • Web-based application • Person-centric - integrates probabilistic matching • Consolidated views of all information for a person • Ability to conduct longitudinal analysis

LVR Probabilistic Matching • Created a match algorithm based on fields unique to patient from laboratory and provider reports • Processed all legacy data ~380,000 records • Program evaluated algorithm and identified thresholds • Results: out of ~380,000 reports the match algorithm was able to link these to ~111,000 unique persons • Probabilistic matching enhanced duplication by 1% as compared to legacy deterministic algorithm

LVR Challenges & Successes • Challenges: • Iterative review process time and resource intensive • Evaluation against legacy deterministic match • Identifying target PPV and NPV • Successes: • Long term savings on time and resources • Streamlined system • Longitudinal analysis • More accurate case counting • Enhanced data quality

Implementing Probabilistic Matching with NYC Hemoglobin A1C Registry (NYCAR)

What is Diabetes? • Diabetes is a chronic disease caused by inadequate insulin levels or sensitivity leading to elevated blood sugar levels • Blood sugar levels can be measured by • Plasma glucose • Fingerstick glucose • Glycosylated hemoglobin or A1C (goal is <7%) • Persistently high blood sugar levels can cause • Heart disease and stroke • Kidney failure • Blindness • Nerve damage and amputation

Diabetes Burden in NYC • Diabetes is epidemic in NYC • Prevalence has more than doubled over the past 10 years. • Approximately 500,000 New Yorkers have diabetes • An additional ~200,000 New Yorkers have diabetes, but have not yet been diagnosed • Approximately 1 in 8 adults have diabetes • In 2006, diabetes was the 4th leading cause of death in NYC

Use of Traditional Public Health Surveillance for Chronic Disease Disease reporting to public health agency to: • Monitor trends Describe glycemic control in NYC • Identify special populations Target individuals with poor control • Communicate with provider community Feedback to providers and their patients • Control epidemics Decrease complications/improve quality of life

Hemoglobin A1C Tests • A1C is a measure of average blood sugar control in preceding 3 months (goal <7%) • A1C is used to: • Monitor individual’s blood sugar control • Guide changes in medication therapy • Impart risk of diabetes complications • Most people who get A1Cs have diabetes so it is a marker for diabetes status THEREFORE, AN A1C REGISTRY WILL PROVIDE A MECHANISM FOR TRACKING INDIVIDUALS WITH DIABETES

Implementation of NYCAR • Based on existing NY State / NYC laboratory reporting system • Amendment to NYC health code, Article 13 which mandates communicable disease reporting, to include A1C • Public hearing Summer 2005 • Approval of amendment December 2005 • Went into effect January 15, 2006 • Laboratories submitting data to NY State and NYC subject to mandate • Report information on patient, ordering provider and facility, testing facility and result • Submit via secure network • Receive ~5,000 new lab reports daily – High Volume

Objectives of New York City A1C Registry (NYCAR) • Surveillance and epidemiology • Track trends on the population level • Provider feedback and communication • Quarterly provider reports in comparison to peers • Quarterly rosters of patients stratified by A1C level • Patient feedback (via provider) • Letters with A1C information • Local resources • Deliver resources to providers/patients All of the above requires matching and data linkages

Components of A1C Registry • Information collected by laboratory reports include: • Individual name, address, date of birth, sex • Name and address of ordering provider, ordering facility and testing facility • A1C test collection date and result

NYCAR Probabilistic Methodology • Created 3 separate matching models: • Patient • Provider • Ordering Facility • Obtained a representative sample of data • For each model - created a match algorithm utilizing fields that uniquely identify each entity • Name (patient, provider, ordering facility), patient dob, gender, address, providerID, telephone number, etc. • Provided match results to program for review and identify thresholds

Program Threshold Evaluation • Due to volume of reports, impractical for staff to review all ambiguous matches – need to set thresholds • Method to identify of thresholds using sample • 2 reviewers and 1 tie-breaker scored matches referencing guidelines • Utilized a sampling method within weight ranges • Identified specific weight or threshold at which target precision rates were met based on review

Deploying Probabilistic Matching • All new incoming A1C lab reports parsed into 3 staging entities: • patient, provider and facilities • Each entity is matched against existing respective entities in the registry • If matched above thresholds, linked to an existing record • If below thresholds, creating a new entity (patient, provider or facility) • Provider Reports and Rosters and Patient Letters are generated using an in-house developed application which reads from the registry

Facility Report Page 2 Note: All information in this slide is fictitious Page 1

Provider Report Note: All information in this slide is fictitious

Patient Letter

Challenges and Successes • Challenges • Quality of record linkage • Need sufficient information for successful linkage of multiple tests per individual as well as master provider and facility indexing • Maintaining accurate facility-provider linkage • Effect of laboratory variation – availability of data • Review thresholds – time and resource intensive • Successes • Entire process is seamless, electronic and automated • High volume of data • Ability to conduct Longitudinal analysis

Is NYC ready for an MPI?

NYC Current Status • Modernizing several disease registries: • Chronic Hepatitis B - completed • NYCAR – completed • STD – requirements completed • TB – requirements completed • HIV – planning Is this an opportune time to develop an MPI?

Planning an MPI: Challenges • Each registry program has requirements for a matching based on: • Patient population • Data quality and volume • Dissemination/Use of Surveillance data • Foster consensus among disease programs • Breach of Security – higher risk • Legal barriers to creating an MPI • Analysis of health code by reportable disease • Political barriers to creating an MPI

Planning an MPI: Benefits • Pooling data from different sources could enhance PPV and NPV of the match • Streamline IT resources • Support staff • Infrastructure • Ability to conduct syndemic surveillance and investigation • More efficient use of limited resources

Diabetes Prevention and Control Program Lynn Silver Shadi Chamany Angela Merges Charlotte Neuhaus Bahman Tabei Cindy Driver Leslie Korenda Division of Informatics and Information Technology Don Weiner Stephen Giannotti Namrata Kumar Jisen Ho Laura Goodman Acknowledgements Bureau of Chronic Disease Control • Katherine Bornschlegl • Magdalena Berger • Emily Lumeng Division of Epidemiology • Lorna Thorpe • Bonnie Kerker • Jenna Mandel-Ricci • Ram Koppaka

Questions? Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene mmavinku@health.nyc.gov (P) 212 515 5182

Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene