490 likes | 502 Views
This paper by Guozhu Dong introduces Pattern-Aided Regression Modeling (PXR) and its application in prediction modeling. The contrast pattern aided regression algorithm CPXR is discussed, along with diverse predictor-response relationships. The efficacy of CPXR in traumatic brain injury (TBI) and heart failure (HF) outcome prediction is highlighted. The methodology's potential applications and results in problem-solving are explored, emphasizing the importance of capturing diverse relationships for accurate predictions.
E N D
Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State University Prediction is difficult, even when it is about the past. Please cite this paper on CPXR: Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression Modeling and Prediction Model Analysis. To appear in IEEE TKDE.
Overview • Introduction • Pattern aided regression modeling: PXR • new regression model type • Contrast pattern aided regression algorithm: CPXR • Diverse predictor-response relationships • CPXR(Log): Traumatic brain injury (TBI) and heart failure (HF) outcome prediction • Potential applications of the CPXR methodology • Other pattern aided problem solving results/apps • Most are contrast pattern aided results • Some use patterns only; some use something extra • Concluding remarks
Prediction is difficult • Prediction is difficult, especially if it is about the future • Nils Bohr, Nobel laureate in Physics • Danish Proverb • Those who have knowledge, don't predict. Those who predict, don't have knowledge. • Lao Tzu, 6th Century BC Chinese Philosopher • Prediction is difficult, even when it is about the past. • Guozhu Dong Guozhu Dong: Pattern Aided Regression Modeling
Preliminaries on prediction using regression LR: Linear regression • Training dataset: {(xi,yi) | 1 <= i <=n} • xi: vector of predictor variables • yi: value of response variable • Regression model evaluation Pattern Aided Regression Modeling Guozhu Dong
Teaser 1: Performance of contrast pattern aided regression (CPXR) RMSE reduction = [RMSE(LR)-RMSE(M)] / RMSE(LR) CPXR: highest accuracy in 41 out of 50 datasets Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing method CPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets. LR: Linear regression GBM: Gradient Boosting (generalization of AdaBoost)
Teaser 2: AUC of ROC curves for CPXR(Log) vs other methods on HF and TBIclassification results HF (with Mayo researchers) TBI Guozhu Dong: Pattern Aided Regression Modeling
CPXR is good because PXR can effectively capture diverse predictor-response relationships to find good PXR models Definition: The data (for given application) contains diverse predictor-response relationships if it contains different subgroups whose best-fit local models are highly different [Dong+TaslimiteheraniTKDE15] Diverse predictor-response relationships are the main reason why best state-of-the-art regression methods perform often poorly Guozhu Dong: Pattern Aided Regression Modeling
PreliminariesPatterns & contrast patterns • Pattern: condition describing set of objects • EG: age <= 35 & rank = full professor • It describes all full professors with age <=35. • A pattern describes a region of data in a low dimensional subspace, of high dimensional data • Contrast patterns: conditions that distinguish objects in different classes/conditions • Contrast patterns are useful: • They are strongly associated with issues of importance
Contrast patterns thru example • CP: A1=b & A3=e • It matches • all C1 objects. • It matches • no C2 objects • Its mds={t1,t2} • Generally: A pattern is CP if it matches many more objects in one class than in other classes (aka emerging patterns) • mds(P): the set of objects matching P. • An equivalence class: A set of patterns with same mds (having same behavior). Pick one as representative.
Why contrast pattern based approaches are successful & have big potentials? • Contrasting is meaningful • Contrastingriskindicatoradvantagesin survival/wellbeing • Contrasting is built into human (animal) instincts • Focus on contrast patterns focus on important issues • High (3—7) dim contrast patterns capture important novel multi-variable interactions related to goals • Opportunity: Humans often use low dimensional CPs (?brain’s computing power is low & lack of data?) • They use independence assumption for high dimension apps • WE WANT TO DO BETTER! WE CAN DO BETTER!!!
Pattern aided regression model • A PXR model is represented by a tuple • Each Pi is a pattern • Each fi is a local regression model for data satisfying Pi, • fd is a default local regression model • Each wi is weight for Pi • The regression function of PM is defined by
A pictorial illustration of a simple PXR model Different patterns can involve different sets of variables [describing data regions in different subspaces] Matching datasets of different patterns can overlap Guozhu Dong: Pattern Aided Regression Modeling
An example PXR model u,v z: predictor variables, y: response variable ((P1, f1, w1), (P2, f2, w2), fd) gives a PXR model Pattern Aided Regression Modeling Guozhu Dong
Discussion • PXR is a strict generalization of Piecewise Linear Regression (PLR) • PLR can be viewed as trying to model diverse PR relationships, but it is limited in modeling capabilities and computing algorithms [and they didn’t see DPR] • Often a PXR model uses few patterns (e.g. 7) • Local regression models • model type can be complex or simple • we often use simple ones such as linear or piecewise linear models Pattern Aided Regression Modeling Guozhu Dong
Diversity of predictor-response relationships • Different pattern-model pairs emphasize different sets of variables • Different pattern-model pairs use highly different regression functions • Each pattern-model pair captures a highly distinct kind of behavior • Diverse predictor-response relationships may be neutralized at the global level Pattern Aided Regression Modeling Guozhu Dong
DPR in TBI (traumatic brain injury) Pattern Aided Regression Modeling Guozhu Dong
How CPXR builds PXR models D = LE U SE LE: Large Error SE: Small Error Baseline Regression Model f0 Training Data D for regression PXR model Local Regression Models for CPs Mine CPs Representative CPs of LE & SE
How CPXR builds PXR models: (D,f0) (LE,SE) Ps and fs PXR Control overritting: don’t use P if |mds(P)| <= #vars Starting with training dataset Build a baseline regression model f0 (or use given f0) Split data into LE (large error) and SE (small error), based on f’s prediction error Mine CPs; Remove some CPs Build corresponding local regression models for remaining CPs, and select patterns to construct PXR Many technical details, including variable binning, splitting data, search objectives, baseline model type, local model type … Pattern Aided Regression Modeling Guozhu Dong
Summary of empirical results • CPXR is highly accurate for building regression models • Outperforms other regression methods, often by big margins • On accuracy • On overfitting • On. sensitivity to noise • Exp says: Diverse predictor-response relationships occur often in real life, for data with >=3 dimensions • We used 50 real datasets, and 20+ synthetic ones Pattern Aided Regression Modeling Guozhu Dong
Previous regression methods Interpretability is low Linear regression (LR): uses a linear function Piecewise linear regression (PLR): splits one variable into intervals, uses a different linear function for each interval Support vector regression (SVR): SVM like, but minimizing prediction error Bayesian additive regression trees (BART): ensemble of (hundreds of) decision trees Neural networks, Gradient Boosting … Pattern Aided Regression Modeling Guozhu Dong
Experiments used 50 real datasets used in previous regression studies 6 example datasets Pattern Aided Regression Modeling Guozhu Dong
CPXR achieved large RMSE reduction (accuracy improvement) consistently RMSE(LR)-RMSE(M) ----------------------------- RMSE(LR) GBM: Gradient Boosting (generalization of AdaBoost) We also tried other competitors but they are not competitive CPXR is not better than other methods on random data CPXR: highest accuracy in 41 out of 50 datasets (4 competitors) Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing method CPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets.
Box plot of RMSE reduction CPXR(LP)’s median > Q3 of PLR,LR,BART CPXR(LP)’s Q1 > median of PLR,LR,BART CPXR(LL) is a variant of CPXR Pattern Aided Regression Modeling Guozhu Dong
Evaluation on overfitting • CPXR is more accurate than PLR, LR and BART on testing data; its model complexity is fairly low. • CPXR has smaller relative accuracy drop (from training to test) Pattern Aided Regression Modeling Guozhu Dong
Evaluation on sensitivity to noise Build PXR on clean training data Compare accuracies on • training data • noise-added test data Pattern Aided Regression Modeling Guozhu Dong
CPXR’s outperformance vs degree of diversity of PR-Relationships PIP: Positive impact pattern Diff = ratio of largest coefficient local model making big improvement of pairs of local models High: CPXR has large RMSE reduction; Low: CPXR has low RMSE reduction Pattern Aided Regression Modeling Guozhu Dong
Diverse Predictor-Response Relationships May Neutralize Each Other at High Level • Example: We considered a set S1 of 4 variables, S2=S1 + 2 variables, on soil water content data • For LR, S2 does not give improvement over S1 • Ditto for PLR, SVR, BART • For CPXR, PXR model on S2 gives 20% RMSE improvement over PXR model on S1 • The new variables are involved in most of the diverse PR relationships (the patterns in the PXR model) • These relationships somehow cancelled each other’s effect, at the whole data set level. missed by LR etc. Pattern Aided Regression Modeling Guozhu Dong
Other apps of CPXR, besides building accurate prediction models • Analysis on a given prediction model • On what kinds of data it make large prediction errors • How to correct those prediction errors • Do important models in science and medicine • have systematic mistakes? • Analysis on comparing two given prediction models, w.r.t. their differences • Discovering policy errors, niche opportunities, … • Discovering true importance of variables (medicine) • Discovering intricate multi-variable interactions • …… Pattern Aided Regression Modeling Guozhu Dong
CPXR for logistic regression modeling and results on outcome prediction for HF/TBI The PXR-CPXR approach is not limited to linear regression. We adapted it for logistic regression to get CPXR(Log) We used CPXR(Log) for outcome prediction for traumatic brain injury patients & heart failure patients. CPXR(Log) is much more accurate than standard logistic regression and SVM CPXR(Log) also identifies important variables that are considered unimportant by standard logistic regression Guozhu Dong: Pattern Aided Regression Modeling
Results on TBI Guozhu Dong: Pattern Aided Regression Modeling
AUC of ROC curves for CPXR(Log) and other methods (on HF and TBI) HF TBI Guozhu Dong: Pattern Aided Regression Modeling
Details on CPXR(Log) Model for TBI Guozhu Dong: Pattern Aided Regression Modeling
Work on Heart Failure Patient Risk Prediction • Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal • Mayo’s EHR Data: • Patient’s demographic data -- age, gender, race and ethnicity. • Lab results -- cholesterol, sodium, hemoglobin and lymphocytes, and EF. • Medications -- Angiotensin Converting Enzyme (ACE) inhibitors, Angiotensin Receptor Blockers (ARBs), β-adrenoceptor antagonists (β-blockers), Statins, and Calcium Channel Blocker (CCB). • 26 major chronic conditions (co-morbidities). • Many variables; many are important/needed Guozhu Dong: Pattern Aided Regression Modeling
Finished CPXR-based projects/papersupto March 2015 Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov A. Pachepsky. Measurement Scale Effect on Prediction of Soil Water Retention Curve and Saturated Hydraulic Conductivity. Submitted. Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury. In Proceedings of IEEE International Conference on BioInformatics and BioEngineering (BIBE) 2014 Building accurate loan default risk models for a private company. 2014-2015. Looking for collaborators Guozhu Dong: Pattern Aided Regression Modeling
Summary of strength of PXR/CPXR More accurate than state-of-the-art methods Philosophy: Using patterns to identify data groups where given model makes large prediction errors that can be corrected systematically Using different pattern-model pairs to model diverse predictor-response variable relationships The approach is better suited to high dimensional data than other methods Offering insights to mistakes in business strategies … Guozhu Dong: Pattern Aided Regression Modeling
I have focused on PXR/CPXR. I will now discuss other pattern aided problem solving methods and applications • Other methods: • Contrast pattern aided clustering • Contrast pattern based classification and improvement of traditional methods • Contrast pattern aided gene ranking for complex diseases • Contrast pattern aided outlier detection • Applications: Diagnosis of diseases, study of complex diseases, blog analysis, compound selection for drug design, crime environ analysis, apartment rental price prediction, activity recognition, …
We published a book on CDM in 2012; 3 out of 6 parts on applications Preliminaries and Measures on Contrasts Contrast Mining Algorithms Mining Generalized Contrasts Contrast Mining for Classification & Clustering Contrast Mining for Bioinformatics & Chemoinformatics Contrast Mining for Special Application Domains 44 contributing authors, from ~dozen countries; not comprehensive Methods used by many scientists.
My recent results in this area • CAEP-style classification: discriminative power aggregation of emerging patterns • Outlier detection / intrusion detection: almost model free; using discriminative pattern length • Clustering quality evaluation using patterns (quality, abundance, diversity): no distance function • CP based clustering and cluster description: no distance func needed • Interaction based gene/SNP ranking for complex diseases • Contrast pattern aided regression: Effectively handling diverse predictor-response relationships
Key Challenges for Pattern Aided Problem Solving Using CPs more efficient • For each problem to solve, our general approach is to use a selected pattern set to help reach our goals. • Q: • (1) What kinds of pattern sets? • (2) How to use the patterns in the set? • (3) How to efficiently search for desired pattern sets? • We need effective techniques • There are millions of (contrast) patterns • The search space is huge Pattern Aided Regression Modeling Guozhu Dong
CPCQ Clustering Quality Index • CPCQ Rationale: A high-quality clustering, capturing natural concepts in data, should have many diversified high-quality contrast patterns (CPs) contrasting its clusters. • A CPcharacterizesits home cluster and discriminates its home cluster against other clusters. • Home cluster of a CP: the cluster where it has highest frequency among all clusters • Think of a cluster as a class.
CPC Algorithm Contrast Pattern Based Clustering – aimed to maximize CPCQ 2. Assign Patterns to CP Group G1 of Clusters, Using MPQ 3. Assign Patterns as CPs of Clusters, Using Tuple Overlap 4. Assign Tuples to Clusters, Using Tuple Overlap 1. Select Seed CPs items ..... tuples C1 S1 S1 S1 PS(C1) PS(C2) C2 S2 S2 S2
Clustering data using CPC into groups, each having succinct informative group descriptions • EG: Given a collection of texts/blogs (collected at ASU). • We cluster the blogs into four groups • each group is associated with a small set of patterns • the patterns clearly indicate what the groups are about.
CAEP: Semi-supervised chemical compound screening with few training samples A special feature is ECP (an adaptation of CAEP)’s ability to accurately classify molecules on the basis of very small training sets containing only a few (e.g. 3 per class) compounds. This feature is highly relevant for virtual compound screening when very few experimental hits are available as templates. Reference: Jens Auer et al. Simulation of sequential screening experiments using emerging chemical patterns. Medicinal Chemistry, 4(1):80–90, 2008. [from its abstract]
IBIG vs IG & FC for Colon Cancer Traditional gene ranking methods: Fold change & entropy: rank genes by considering impact of one gene at a time. Are they suitable for complex diseases? Genes lowly ranked by FC could be highly ranked by IBIG
Quote from our Contrast Data Mining book … the most important contribution of contrast mining will come when we no longer need … simplifying approaches to handle … challenge of high dimensional data, when we have developed the methodology to systematically analyze, and accurately use, sets of multi-feature contrast patterns …. … contrast mining has made useful progress …. Success … will have a large impact on … handling of intrinsically complex processes, such as complex diseases whose behaviors are influenced by the interaction of multiple … factors.
Potentials (1): Develop New Pattern Aided Methods • Selecting set of patterns can help solve existing challenging problem in much better way • Identify such problems, work on them … • Using sets of patterns can lead to systematic ways of handling multiple multi-variable interactions • Contrast mining, pattern aided problem solving, & pattern aided data analytics, have potential to help effectively handle challenges of high dimensional data
Potentials (2): Use Current Pattern Aided Methods to Solve Problems • Our developed methods can help perform regression more accurately, characterize & correct errors of given models, for vital applications in science, medicine, & economics • Improving scientific models • Changing what we believe in? • Our developed clustering, classification, outlier detection, gene ranking, multiple multi-variable interaction mining/selection methods can help solve challenging problems and offer new insights
www.cs.wright.edu/~gdong Questions Next step – wish to find collaborators To work on high impact prediction modeling problems