740 likes | 755 Views
Explore statistical approaches for analyzing Traditional Chinese Medicine patient records, focusing on disease profiles, symptom-group correlations, and herb effectiveness. Discover new medical knowledge through empirical hypotheses exploration and collaboration with TCM Data Center. Enhance healthcare with personalized treatments.
E N D
ChengXiang (“Cheng”) Zhai Department of Computer Science Affiliated with Carl R.Woese Institute for Genomic Biology Department of Statistics School of Information Sciences University of Illinois, Urbana-Champaign http://czhai.cs.illinois.edu, czhai@illinois.edu Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records Distinguished Lecture in Causal Discovery, Univ. of Pittsburgh, May 18, 2017
Outline Overview of TIMAN Group Traditional Chinese Medicine (TCM) data analysis KnowEng Center and opportunities for collaboration
My Research Group: TIMAN(Text Information Management & Analysis) http://timan.cs.uiuc.edu Current: 11 Ph.D. students 5 MS students 2 Undergraduates 1 Visiting scholar Alumni: 27 Ph.D. students 40 MS students 20 Undergraduates Academia + Industry Funding:
Research Roadmap of TIMAN Text Mining Decision Support Information Retrieval Biomedical researchers Physicians Patients … Biomedical Literature Medical Records Health Forums … Big Raw Data Small Relevant Data • We emphasize • - Development of general and robust computational methods • optimization of human-computer collaboration • + computers help human find patterns in data • + humans train computers to make them intelligent
Our Work in Medical & Health Informatics How can we find similar medical cases in medical literature, in online forums, …? EHR (Patient Records) Medical Case Retrieval Improved Health Care Similar Medical Cases Medical Knowledge Discovery Medical Knowledge How can we analyze EHR together with other knowledge bases to discover medical knowledge (e.g., effectiveness of drugs, ADRs) from EHR?
Today’s Talk: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)
Traditional Chinese medicine (TCM) Tu Youyou屠呦呦 2015 Nobel Prize in Physiology or Medicine (jointly with William C. Campbell and Satoshi Ōmura) for discovering artemisinin (青蒿素) • More than 2,500 years of Chinese medical practice • Complementary and alternative medicine • Knowledge is not well-documented • A significant challenge to inherit/use TCM knowledge • An interesting opportunity for data mining!
Vision of TCM Data Mining • TCM patient records = “experimental results” of herbs on patients • Potentially discover effective herbs for treating particular groups of patients • Effective chemical ingredients in effective herbs • Combination with western medicine + genomics + biomedical knowledge • Discover new medical knowledge & Provide a scientific foundation for TCM • TCM philosophy Individualized experiments • Large-space of empirical hypotheses explored
Collaboration with Beijing TCM Data Center Sheng Wang Edward Huang • Data Sets: > 300,000 clinical cases • Collected EMRs since 2007 from six hospitals • Data stored in a clinical data warehouse • Key Collaborators: • Dr. Runshun Zhang, Dr. Jie Liu: Guang’anmen Hospital, Chinese Academy of Chinese Medical Sciences • Prof. Xuezhong Zhou, Department of Computer Science, Beijing Jiaotong University • PhD Students at UIUC
Outline Probabilistic generative model S. Wang, E. Huang, R. Zhang, X. Zhang, B. Liu. X. Zhou, C. Zhai, A Conditional Probabilistic Model for Joint Analysis of Symptoms, Diseases, and Herbs in Traditional Chinese Medicine Patient Records,IEEE BIBM 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)
Problem definition • Input: TCM patient records (with symptoms, diseases, and herbs) • Output: disease profile: • Typical symptom groups associated with disease • Typical herb groups associated with disease • Correlations between symptom groups and herb groups
Challenge: mixed vocabulary • Each record may contain multiple diseases • Patient may have anemia, hypertension and chronic gastritis at the same time • The corresponding symptoms come from different diseases
Our solution • A conditional probabilistic topic model • Explicitly model each disease as one topic • Each disease has its own distribution of symptoms and herbs • Difference from topic models (e.g., PLSA/LDA) • Diseases as priors on topics. • A topic with two coordinated subtopics (one for herbs and one for symptoms) • Difference from previous work on TCM analysis • Modeling asymmetric causal relations between diseases and {symptoms, herbs}
The Conditional Symptom-Herb Model Typical symptoms of disease d Typical herbs of disease d Likelihood of patient t having disease d Symptom Disease Herb All diseases of patient t Observed Patient Record Optimization: Find optimal Pr(s|d), Pr(h|d) and Pr(d|t) so that the probability of all the observed { Pr(s,h|t)} would be maximized (EM algorithm)
Evaluation: Dataset • 10,907 patients TCM records in digestive system treatment • 3,000 symptoms, 97 diseases and 652 herbs • Most frequently occurring disease: chronic gastritis • Most frequently occurring symptoms: abdominal pain and chills • Ground truth: 27,285 manually curated herb-symptom relationship.
herb-symptoms relationships Top 10 herb-symptoms relationships identified by our method but not by frequent pattern mining
Summary of Disease Profiling • A new probabilistic model to analysis traditional Chinese medicine patient record • discovers meaningful TCM knowledge and outperforms previous work • can be used to develop a practically useful clinical decision making system • Future Work • Build an application system (e.g., recommending herbs) • Analyze effectiveness of herbs
Outline Improve clustering by exploiting domain knowledge E. Huang, S. Wang, R. Zhang, B. Liu, X. Zhou, C. Zhai, PaReCat: Patient Record Subcategorization for Precision Traditional Chinese Medicine, ACM BCB 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huanget al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)
Approach • Use both herb and symptom information for clustering • Challenge: data sparseness • Solution: Leverage domain knowledge (TCM dictionary); using an embedding approach
Data Description • 2,276 medical records Manually annotated by doctors • 3 level 1 labels • 51 level 2 labels • 274 level 3 labels
Clustering • Agglomerative clustering • Cosine similarity as affinity measure • Clustered with level 2 labels as ground truth • Number of clusters = number of level 2 labels • Two feature types • Clustering using only symptoms • Clustering with both symptoms and herbs • Didn’t do clustering using only herbs (symptoms are assumed to be always available)
Evaluation • Adjusted Rand Index (ARI) • Maximum score of 1.0 • Counts overlaps in contingency table • Adjusted for chance
Summary of Subcategorization First study of subcategorization of TCM records Improve clustering by using TCM dictionary Experiment results show that using TCM is beneficial Future work: comparative analysis of similar subcategories & effectiveness of herbs
Outline Integration of EMRs and biomedical information network E. W. Huang, S. Wang, B. Li, R. Zhang, B. Liu, R. Zhang, J. Liu, X. Zhou, H. Lin, C. Zhai. HEMnet: Integration of Electronic Medical Records with Molecular Interaction Networks and Domain Knowledge for Survival Analysis, under review • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)
Challenge in Analysis of EMR: Missing Data • Many data mining methods assume availability of all features • Assumption usually doesn’t hold for EMR • Example: doctors do not perform all medical tests on all patients • Mean imputation • Most common method to fill missing values • Introduces noise rather than reduce it
Challenge in Analysis of EMR: Semantic mismatch • Features that mean the same thing, but occupy different spaces in the vocabulary • Example: “hypertension” and “high blood pressure” • Similar problem in text mining can be solved with word2vec or other word embedding techniques, but such techniques are insufficient for EMR (e.g., med2vec)
Solution: External Knowledge • Idea: use external knowledge (expert curated, literature mined, etc.) to fill in missing context information • Molecular interaction networks, such as protein-protein interaction (PPI) networks (protein-protein edges) • Known drug targets (drug-protein edges) • Known drug-symptom relationships (drug-symptom edges) • But how can we incorporate all of this external information into EMRs?
HEMnet • Create a heterogeneous information network: HEterogeneous Electronic Medical network (HEMnet) • Use co-occurrence information to build network • Example: create edge between node d and node s if, for a patient, d is a drug that is prescribed to treat some symptom s. • With this co-occurrence network, we can add in the PPI network and other domain knowledge to create the HEMnet
Data Description • Lung cancer data set • 43 patients with squamous-cell lung carcinoma • 90 patients non-squamous-cell lung carcinoma • 449 unique features (excluding proteins) • 133 patient records with high degree of sparsity • Each record missing an average of 407 out of 449 features • Patient record matrix: 133x449 matrix • Each record also contains the patient’s survival information • Number of days until death • If no death event, number of days until hospital discharge • HEMnet • 11,911 nodes (most of these are proteins) • 379,715 edges • 23 edge types
Using HEMnet to enrich EMRs • With the HEMnet, we now wish to fill in the missing “contexts” of the medical records • Use network embedding technique, ProSNet, which has been shown to be useful in protein function prediction • word2vec: in sentence, a word’s neighbors should predict the word’s context • ProSNet: in graph, a node’s neighbors should predict the node’s context • Get a low-dimensional vector for each node • Get a pairwise cosine similarity matrix of nodes • Multiply matrix into patient record matrix
Heterogenous network PosNet Network Embedding [Wang et al. 17] z E3 E2 E4 E1 y Node/Entity vector space x E5 ProsNet Dimensionality Reduction Input Output • General: applicable to any heterogeneous network • e.g., gene, drug, disease network • Efficient: • 60K nodes and 10M edges: less than 30 minutes on a 12-core CPU
Path based nodes similarity score Score of path M starts from node u and ends at node v Weights on different dimension of Xv Global bias for path type M Weights on different dimension of Xu Compact node representation
Fast online optimization • Challenge: • The heterogeneous network may have millions of nodes and edges • Methods based on random walk with restart can only handle network with ~10K nodes • Solution: Online learning • Randomly sample a path in the network every time • Only optimize nodes on this path • Sampling according to node degree • Inherently parallelized • A network with 60K nodes and 10M edges. • Less than 30 minutes on a 12-core CPU
Experiments • Compare performance of HEMnet-enriched patient record matrix with the baseline patient record matrix • Clustering experiment • Hospitals are interested in whether a patient will survive • With patient survival functions as ground truth, we want to separate patients into two groups • Thus, the best clustering is one in which the two clusters have the most different survival functions • Survival functions can be estimated with Kaplan-Meier curves, then compared with log-rank test • Low p-value means the two clusters have significantly different survival rates • So, the lower the p-value, the more successful the method was in separating relatively healthy patients from those with short survival times
Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.001020 Baseline, p-value = 0.01267
Non-Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.003652 Baseline, p-value = 0.02333
Summary of Survival Analysis HEMnet-enriched patient records can improve the performance of clustering over the baseline For the baseline, neither cancer subtype was significantly separated at the 0.01 level, while the HEMnet-enriched feature matrix was Therefore, the external information (PPI network and domain knowledge) is helpful in solving the issues associated with missing data and semantic mismatch
Summary: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)