Discovering Patterns in Adverse Drug Reactions

Discovering Patterns in Adverse Drug Reactions Student: Ernst Joham Supervisor: Associate Prof Jiuyong Li Associate Supervisor Dr. Jan Stanek

Outline • Background • Motivation • Research questions • Literature review • Data Mining process • Results • Conclusion

Background • What is data mining? Data mining is used to discover unexpected, interesting and valuable information in datasets. • High percentage of patients admitted or prolonged hospitalisation is due to ADRS. • What can cause ADRS? • Amount of dosage given to patients • More then one drug taken at the same time • Ingredients in drugs which can result in adverse reaction.

Background • Problems with medical datasets • Medical data is more diverse and complex • Ethical and legal issues • Data quality • Missing values • Noise • Ownership • Lack of information

Motivation • To have a successful outcome in discovering patterns for medical datasets • Finding the most suitable algorithms to handle noise and missing values for medical datasets • Improve complexity and diversity of medical datasets

Research Questions • The aim of the research was to use data mining methods in an attempt to produce relevant results from real world medical data. • The following research questions were answered (1) Is it possible to discover patterns in spares datasets? (2) What patterns can be identified through data mining for ADRs?

Literature review (techniques) • Decision Tree, Logistic programs, K nearest neighbour and Bayesian classifier techniques have been applied to medical datasets (Laverac 1999). • Lee et al(2000) states that techniques that easily extract specific knowledge are the key for medical decision. • A study on drug discovery showed that neural networks performed better then logistic regression, but decision tree performed better in identifying active compounds (Obenshain 2004).

Literature review (process model) • Medical data mining applications that is expected to discover new knowledge should follow a five stage process model (Wang 2000). • planning tasks • developing data mining hypotheses • preparing data • selecting data mining tools • evaluating data mining results. • Cios & Moore 2002 state that for success you need to follow the DMKD that adds several steps to the CRISP-DM model that has been applied to several medical problem domains.

Literature review (problems with medical datasets) • Brown & Kros (2003) focused on the impact of missing data and how existing methods can help. They categories methods for dealing with missing data into: • Use complete data only • Delete selected case or variables • Data imputation • Model-based approaches • Some researchers have focused on data cleansing tools to help eliminate noise but this can only achieve a reasonable result (Zhu & Wu 2004).

Literature review • (Zhu & Wu 2004). Attribute noise is more difficult to handle and include: • (1) Incorrect attribute values • (2) Missing or don’t know attribute values • (3) Incomplete attributes or don’t care values

Data Mining Processing • The project used the data mining method of CRISP_DM six step data mining process • Understand the main aim of the project • Understand the dataset ADRDATE Agedays BRAND DRUG ID Prob ROUTE Recov Severity URNO ATC 31/01/2007 Lyclear Permethrin 707 Cert Topical Rec Minor unknown P03AC04 9/06/2003 14367 Tegretol CR Carbamazepine 4 Cert Oral Rec ax6cx8z N03AF01 11/06/2003 1 4173 Zoloft Sertraline 5 Unc Oral ax66486 N06AB06

Data mining Process Summary of missing values Total 1286 records

Data Mining Process • Data .csv format • R programming language • Rattle tool for data mining • Data preparation • Remove duplicates • Correct misspelled words • Correct meanings of values • Find missing ATC values (Anatomical Therapeutic Chemical) • Leave missing values for rest of dataset

Data mining Process • Data transformation • Date when the patient was admitted to hospital for ADRs (October-March =1, April-September = 0) • How old the patient is categorised into equal number of records.(0-2 years old = 1, 2-5 years old = 2, 5-11 years old = 3, 11-16 years old = 4, and above 16 years of age = 5) • The administration of the medication that caused the ADR is either oral or intravenous.(Oral = 1, Intravenous = 0) • Recovered from ADRs or not.(Recovered = 0, Not recovered = 1) • The drugs given to the patient either are antibiotics or not.(Antibiotics =1, Not Antibiotics =0)

AGE ROUTE ROUTE Data Mining Processing ADRDATE AGE RECOV ATC ROUTE

Data Mining Process • Modelling phase • Logistic regression, • Decision tree, • Risk pattern algorithm • Evaluation Phase • Deployment

Results • Results for the logistic regression technique Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.901353 0.466304 -4.077 4.55e-05 *** ADRDATE 0.136312 0.285722 0.477 0.633 AGEDAYS 0.002067 0.115482 0.018 0.986 ROUTE 0.059532 0.290016 0.205 0.837 ANTIBIOTICS -0.181255 0.300150 -0.604 0.546

Results • Decision Tree Result 1) root 1035 473 1 (0.4570048 0.5429952) 2) AGE>=3.5 407 140 0 (0.6560197 0.3439803) 4) ADRDATE< 0.5 203 61 0 (0.6995074 0.3004926) * 5) ADRDATE>=0.5 204 79 0 (0.6127451 0.3872549) 10) AGE>=4.5 100 35 0 (0.6500000 0.3500000) 20) ROUTE>=0.5 79 27 0 (0.6582278 0.3417722) * 21) ROUTE< 0.5 21 8 0 (0.6190476 0.3809524) 42) RECOV=Yes 18 6 0 (0.6666667 0.3333333) * 43) RECOV=NO 3 1 1 (0.3333333 0.6666667) *

Results • Decision Tree Result 11) AGE< 4.5 104 44 0 (0.5769231 0.4230769) 22) ROUTE< 0.5 77 30 0 (0.6103896 0.3896104) * 23) ROUTE>=0.5 27 13 1 (0.4814815 0.5185185) * 3) AGE< 3.5 628 206 1 (0.3280255 0.6719745) 6) ROUTE< 0.5 236 109 1 (0.4618644 0.5381356) 12) RECOV=NO 24 6 0 (0.7500000 0.2500000)

Results • Risk patterns for NO • 33.03242.4852269 7ADRDATE1A GEDAYS3ANTIBIOTICS0 • 23.10022.5582624616AGEDAYS3ANTIBIOTICS0 332.56632.19042596ADRDATE 1AGEDAYS4 ROUTE1 432.53752.175734268AGEDAYS4 ROUTE1 ANTIBIOTICS0 • Pattern 1 where Risk Ratio = 2.48 • Agedays = between 5-11 years old • Adrdate = months between October – March • Antibiotics = No

Conclusion • Building a data mining process to answer the problem posed. • Use algorithms that work for medical applications • Noise and missing values does pose a problem but reasonable results can still be achieved. • More relevant patterns can be produced for medical experts if maximum information is included in the dataset.

Reference • Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial Management & Data Systems, vol. 103, pp. 611-621. • Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine, vol. 26, no. 1-2, pp. 1-24. • CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August 2008, <http://www.crisp-dm.org/Partners/index.htm>. • Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R & Kelman, C 2005, Mining risk patterns in medical data, ACM, Chicago, Illinois, USA. • Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence in medicine, vol. 16, no. 1, pp. 3-23. • Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical information', Medical Informatics & the Internet in Medicine,vol. 25, no. 2, pp. 81-102. • Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’, Infection Control and Hospital Epidemiology, vol.25, no 8, pp. 690-695. • Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse DrugReaction Why Health Professionals Need to Take Action, WHO publications, viewed15 April 2008, http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>. • Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper presented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on, Xiamen • Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on mining low-quality data', Knowledge and Information Systems, vol. 11, no. 2, pp. 131-136.

Discovering Patterns in Adverse Drug Reactions