640 likes | 797 Views
Square wheels: electronic medical records for discovery research in rheumatoid arthritis. ^ genetic. Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored " Using EHR Data for Discovery Research ". HARVARD MEDICAL SCHOOL. Key questions.
E N D
Square wheels: electronic medical records for discovery research in rheumatoid arthritis ^ genetic Robert M. Plenge, M.D., Ph.D. October 30, 2009 NCRR sponsored "Using EHR Data for Discovery Research" HARVARD MEDICAL SCHOOL
Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
Key questions How can I implement your approach, and how much better is it?
genotype phenotype clinical care
genotype bottleneck phenotype clinical care
October 2009: >30 RA risk loci Together explain ~35% of the genetic burden of disease REL BLK TAGAP CD28 TRAF6 PTPRC FCGR2A PRDM1 CD2-CD58 CD40 CCL21 CD244 IL2RB TNFRSF14 PRKCQ PIP4K2C IL2RAAFF3 TNFAIP3 STAT4 TRAF1-C5 IL2-IL21 HLA DR4 “shared epitope” hypothesis PADI4 PTPN22 CTLA4 2009 1978 1987 2003 2004 2005 2007 2008 Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples: >10 new loci Raychaudhuri et al in press Nature Genetics
genotype phenotype bottleneck clinical care
Genetic predictors of response to anti-TNF therapy in RA PTPRC/CD45 allele n=1,283 patients P=0.0001 Submitted to Arth & Rheum
How can we collect DNA and detailed clinical data on >20,000 RA patients?
What are the options for collecting clinical data and DNA for genetic studies?
Content of EMRs EMRs are increasingly utilized! • Narrative data = free-form written text • info about symptoms, medical history, medications, exam, impression/plan • Codified data = structured format • age, demographics, and billing codes
This is not a new idea… Sens: 89% PPV: 57% Gabriel (1994) Arthritis and Rheumatism
…but EMR data are “dirty” Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis. Gabriel (1994) Arthritis and Rheumatism
4 million patients ICD9 RA and/or CCP checked (goal = high sensitivity) 31,171 patients Classification algorithm (goal = high PPV) 3,585 RA patients Discarded blood for DNA Clinical subsets
Our library of RA phenotypes Qing Zeng • Natural language processing (NLP) • disease terms (e.g., RA, lupus) • medications (e.g., methotrexate) • autoantibodies (e.g., CCP, RF) • radiographic erosions • Codified data • ICD9 disease codes • prescription medications • laboratory autoantibodies
Our library of RA phenotypes Shawn Murphy • Natural language processing (NLP) • disease terms (e.g., RA, lupus) • medications (e.g., methotrexate) • autoantibodies (e.g., CCP, RF) • radiographic erosions • Codified data • ICD9 disease codes • prescription medications • laboratory autoantibodies
‘Optimal’ algorithm to classify RA: NLP + codified data Codified data NLP data Regression model with a penalty parameter (to avoid over-fitting) Tianxi Cai, Kat Liao
High PPV with adequate sensitivity ✪392 out of 400 (98%) had definite or possible RA!
This means more patients! ~25% more subjects with the complete algorithm: 3,585 subjects (3,334 with true RA) 3,046 subjects (2,680 with true RA)
4 million patients ICD9 RA and/or CCP checked (goal = high sensitivity) 31,171 patients Classification algorithm (goal = high PPV) 3,585 RA patients Discarded blood for DNA
Linking the Datamart-Crimson NLP data Codified data
Status of i2b2 Crimson collection genotyping of 384 SNPs (RA risk alleles, AIMs, other) is ongoing at Broad Institute • Over 3,000 samples collected to date • cost = $10 per sample • DNA extracted on >2,400 Buffy coats • cost = $20 per sample • >90% had ≥1 ug of DNA • >99% had ≥5 ug of DNA after WGA
Status of i2b2 Crimson collection stay tuned…more data soon! • Measured autoantibodies from plasma • 5 autoantibodies in ~380 RA patients • ~85% are CCP+, ~35% ANA+, ~15% TPO+ • Question: are non-RA autoantibodies present at increased frequency in RA patients vs matched controls?
Key questions How can I implement your approach, and how much better is it?
Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?
Regulatory obstacles IRB approval De-identified vs truly anonymous Open question: sharing of genetic data
Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?
Resources required • Building a research DataMart • clinical EMR ≠ research EMR • multiple FTE’s to build/maintain • NLP expertise • open-source software available • iterative process for fine-tuning • Clinical expertise • understand nature of clinical data
Resources required (cont.) • Statistical expertise • simple algorithm is not sufficient • prepare for the unexpected! • true for narrative and codified • Biospecimen collection, DNA extraction • varies by institution • Crimson • Broad Institute
Key questions What are the regulatory obstacles impacting your work? What are the resource needs required to replicate your work at other institutions? What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
4 million patients ICD9 RA and/or CCP checked (goal = high sensitivity) 31,171 patients Classification algorithm (goal = high PPV) 3,585 RA patients Discarded blood for DNA Clinical subsets
Clinical features of patients CCP has an OR = 1.5 for predicting erosions
Subset patients in clinically meaningful ways: causes of mortality NLP+codified data, together with statistical modeling, to define cardiovascular disease
Non-responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response
Responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response
Post-marketing surveillance of adverse events pharmacovigilance NLP+codified data, together with statistical modeling, to define treatment response
Options for clinical + DNA Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.
Options for clinical + DNA Conclusion: We can collect DNA and plasma in a high-throughput manner.
Options for clinical + DNA Conclusion: The cost is reasonable...even for >20,000 RA patients!
genotype phenotype clinical care
Acknowledgments Zak Kohane Susanne Churchill Vivian Gainer Kat Liao Tianxi Cai Shawn Murphy Qing Zing Soumya Raychaudhuri Beth Karlson Pete Szolovits Lee-Jen Wei Lynn Bry (Crimson) Sergey Goryachev Barbara Mawn & many others ! Namaste!
Narrative data (NLP text extractions) Codified data (ICD9 codes, etc)