210 likes | 224 Views
Explore active learning algorithms to reduce human annotation costs in de-identifying medical records. Gain insights into machine learning and challenges faced in processing natural language data.
E N D
Efficient Active Learning for Electronic Medical Record De-identification Privacy and Bias in Data Science S45 Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19
Disclosure • I disclose the following relevant relationship with commercial interests: • I am an employee of Privacy Analytics, an IQVIA company, which builds and markets de-identification products 2019 Informatics Summit | amia.org
Learning Objectives • After participating in this session the learner should be better able to: • Understand the challenge posed by the human annotation cost in de-identifying natural language data, and be able to discuss several active learning algorithms to reduce such expenditure in de-identifying electronic medical records and clinical study reports. 2019 Informatics Summit | amia.org
What is De-identification? Protected Health Information (PHI)removal Quasi-identifiers (QI) DOB Zip-code Gender Ethnicity … Direct identifiers (DI) Name SSN Medical record number … Challenge for natural language data 2019 Informatics Summit | amia.org
Automated De-identification Tools Rule based Local knowledge, hand-crafted rules Not always easy to gather Machine learning based Models inferred automatically from annotated training data A certain amount of human annotation Hybrid Integrate the best of both Also needs both local knowledge and human annotation 2019 Informatics Summit | amia.org
Machine Learning Based De-identification Named Entity Recognition Conditional Random Fields (CRFs) Capture dependencies between type labels what we use in this work (MIST) Recurrent Neural Networks (RNNs) Do not need handcrafted features or rules, can automatically extract features 2019 Informatics Summit | amia.org
Machine Learning De-identification Workflow Unannotated Natural Language Data Random Sample Protected Data Sampled data to annotate by human Gold Standard Data • PHI Detection Models Model Training Annotation Human • Detection Tools 2019 Informatics Summit | amia.org
Scalability Challenge Appropriately trained systems I2b2 2014 challenge: best F-measure of 0.964 Best F-measure of 0.979 via a deep neural network De-identification models should not be used off-the-shelf 2016 CEGS N-GRID shared tasks: best F-measure around 0.8 Can we engineer a system to learn faster? State-of-the-art: constantly needs sufficiently high-quality training data Poorly trained systems mean more time/cost in human correction 2019 Informatics Summit | amia.org
Why Active Learning? Hypothesis More informative data actively requested Less training data needed Performance maintained or even improved Learn a model Machine learning model Labeled training set Unlabeled pool U Select queries Oracle (human annotator) 2019 Informatics Summit | amia.org
Active Learning De-identification Workflow Heuristics Selection criteria Next batch of data to annotate by human Unannotated Natural Language Data Random Sample Query Protected Data Initial batch of data to annotate by human Gold Standard Data • PHI Detection Models Model Training Annotation Human • Detection Tools 2019 Informatics Summit | amia.org
Selection Criteria Heuristics Least Confidence with Upper Bound (LCUB) • Entropy with Lower Bound (ELB) • , Net contribution of human correction per FP Net contribution of human correction per FN Average reading cost per token Return On Investment (ROI) , expected ROI of token labeled as non-PHI 2019 Informatics Summit | amia.org
Real World Clinical Trials Dataset 370documents 312991 tokens 12 PHI types 7098 PHI instances 2019 Informatics Summit | amia.org
Preliminary Analysis 2019 Informatics Summit | amia.org
LCUB Case 1 . . . Design of Simulation Experiments Query Setting Query Strategy Batch Size LCUB Case n ELB Case 1 LCUB Best case LCUB . . . Batch Size 10 ELB Best case ELB Case m ELB ROI Case 1 Initial batch of 10 documents Batch Size 5 . . . ROI Best case ROI Batch Size 1 ROI Case k Random Random 2019 Informatics Summit | amia.org
AL Learning Rate The advantage of active learning becomes more apparent with smaller batch sizes Active learning surpasses passive learning 2019 Informatics Summit | amia.org
Reduction in Training Time Smaller batch sizes need less training than bigger batch sizes but more re-training time Active learning needs less training than passive learning 2019 Informatics Summit | amia.org
i2b2 2006 Dataset 889discharge summaries Real identifiers replaced by synthetic information 2019 Informatics Summit | amia.org
Lessons Learned • Active learning could lead to comparable or higher performance with less training data needed than passive learning • Smaller batch sizes means faster learning, but also could result in more re-training time • ROI usually is the most stable, but not necessarily always performs the best 2019 Informatics Summit | amia.org
Summary And Future Work • Active learning adopted in training data selection for natural language de-identification could generally result in more efficient learning than passive learning • Collect data for actual human correction costs and contributions in real-world problems • An adaptive batch sizing strategy might lead to better training • Deep neural networks might be considered for the active learning system 2019 Informatics Summit | amia.org
References • [1] U.S. Department of Health and Human Services, "Standards for privacy and individually identifiable health information. Final rule," vol. 67, no. 157, pp. 53181 - 53273, 2002. • [2] W. W. Chapman, P. M. Nadkarni, L. Hirschman, L. W. D'Avolio, G. K. Savova and O. Uzuner, "Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions," Journal of the American Medical Informatics Association : JAMIA, vol. 18, no. 5, pp. 540-3, 2011. • [3] S. Velupillai, H. Dalianis, M. Hassel and G. Nilsson, "Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial," International journal of medical informatics, vol. 78, no. 12, pp. e19 - 26, 2009. • [4] I. Neamatullah, M. L. L. Douglass, A. Reisner, M. Villarroel, W. Long, P. Szolovits, G. Moody, R. Mark and G. Clifford, "Automated de-identification of free-text medical records," BMC medical informatics and decision making, vol. 8, no. 1, p. 1, 2008. • [5] S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen and M. H. Samore, "Automatic de-identification of textual documents in the electronic health record: a review of recent research," BMC Medical Research Methodology, vol. 10, no. 1, p. 70, 2010. • [6] O. Ferrández, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore and S. M. Meystre, "BoB, a best-of-breed automated text de-identification system for VHA clinical documents," Journal of the American Medical Informatics Association, vol. 20, no. 1, pp. 77-83, 2013. • [7] J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, B. Malin and L. Hirschman, "The MITRE Identification Scrubber Toolkit: design, training, and assessment," International Journal of Medical Informatics, vol. 79, no. 12, pp. 849-59, 2010. • [8] B. Settles, "Biomedical named entity recognition using conditional random fields and rich feature sets.," Proceedings of International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 104 - 7, 2004. • [9] F. Dernoncourt, J. Y. Lee, O. Uzuner and P. Szolovits, "De-identification of patient notes with recurrent neural networks," Journal of the American Medical Informatics Association, vol. 24, no. 3, pp. 596-606, 2017. • [10] Stubbs A, Filannino M, Uzuner O. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1. J Biomed Inform. 2017; 75: S4-S18. • [11] Stubbs A, and Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015; 58: S20-S29. • [12] Settles B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012; 6(1):1-14. • [13] Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks. Proc Conference on Empirical Methods in Natural Language Processing. 2008: 1070-9.
Thank you! • Questions? Email me at: [rli@privacy-analytics.com]