490 likes | 665 Views
Medical Data Mining. Carlos Ordonez University of Houston Department of Computer Science. Outline. Motivation Main data mining techniques: Constrained Association Rules OLAP Exploration and Analysis Other classical techniques: Linear Regression PCA Naïve Bayes K-Means
E N D
Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science
Outline • Motivation • Main data mining techniques: • Constrained Association Rules • OLAP Exploration and Analysis • Other classical techniques: • Linear Regression • PCA • Naïve Bayes • K-Means • Bayesian Classifier 2/45
Motivation: why inside a DBMS? • DBMSs offer a level of security unavailable with flat files. • Databases have built-in features that optimize extraction and simple analysis of datasets. • We can increase the complexity of these analysis methods while still keeping the benefits offered by the DBMS. • We can analyze large amounts of data in an efficient manner.
Our approach • Avoid exporting data outside the DBMS • Exploit SQL and UDFs • Accelerate computations with query optimization and pushing processing into main memory
Constrained Association Rules • Association rules – technique for identifying patterns in datasets using confidence • Looks for relationships between the variables • Detects groups of items that frequently occur together in a given dataset • Rules are in the format X => Y • The set of items X are often found in conjunction with the set of items Y
The Constraints • Group Constraint • Determines which variables can occur together in the final rules • Item Constraint • Determines which variables will be used in the study • Allows the user to ignore some variables • Antecedent / Consequent Constraint • Determines the side of the rule that a variable can appear on
Experiment • Input dataset: p=25, n=655 • Three types of Attributes: • P: perfusion measurements • R: risk factor • D: heart disease measurements
Experiments • This table summarizes the impact of constraints on number of patterns and running time.
Experiments • This Figure shows rules predicting no heart disease in groups.
Experiments • This figure shows groups of rules predicting heart disease.
Experiments • These figures show some selected cover rules, predicting absence or existence of disease.
OLAP Exploration and Analysis • Definition: • Input table F with n records • Cube dimension: D={D1,D2,…Dd} • Measure dimension: A={A1,A2,…Ae} • In OLAP processing, the basic idea is to compute aggregations on measure Ai by subsets of dimensions G, GD.
OLAP Exploration and Analysis • Example: • Cube with three dimensions (D1,D2,D3) • Each face represents a subcube on two dimensions • Each cell represent subcube on one dimension
OLAP Statistical Tests • We proposed the use of statistical tests on pairs of OLAP sub cubes to analyze their relationship • Statistical Tests allow us to mathematically show that a pair of sub cubes are significantly different from each other
OLAP Statistical Tests • The null hypothesis H0 states 1=2 and the goal is to find groups where H0 can be rejected with high confidence 1-p. • The so called alternative hypothesis H1 states 12 . • We use a two-tailed test which allows finding a significant difference on both tail of the Gaussian distribution in order to compare means in any order (12 or 21). • The test relied on the following equation to compute a random variable z.
Experiments • n = 655 • d = 21 • e = 4 • Includes patient information, habits, and perfusion measurements as dimensions • Measures are the stenosis, or amount of narrowing, of the four main arteries of the human heart
Experiment Evaluation • Heart data set: Group pairs with significant measure differences at p=0.01
Experiment Evaluation • Summary of medical result at p=0.01 • The most important is OLDYN, SEX and SMOKE.
Comparing Reliability of OLAP Statistical Tests and Association Rules • Both techniques altered to bring on same plane for comparison • Association Rules: added post process pairing • OLAP Statistical Tests: added constraints • Cases under study • Association Rules (HH) – both rules have high confidence • AdmissionAfterOpen(1), AorticDiagnosis(0/1)=>NetMargin(0/1) • High confidence, but also high p-value • Data is crowded around AR boundary point 19/45
Comparing Reliability of OLAP Statistical Tests and Association Rules • Association Rules: High/High • We can see that the data is crowded around boundary point for Association Rules • Two Gaussians are not significantly different • Conclude: both agree, OLAP Statistical Tests is more reliable 20/45
Comparing Reliability of OLAP Statistical Tests and Association Rules • Association Rules: Low/Low • Once again boundary point comes into play • Two Gaussians are not significantly different • Conclude: both agree 21/45
Comparing Reliability of OLAP Statistical Tests and Association Rules • Association Rules: High/Low • Ambiguous 22/45
Results from TMHS dataset • Mainly financial dataset • Revolves around opening of a new medical center for treating heart patients • Results from Association Rules • Found 4051 rules with confidence>=0.7 and support>=5% • AfterOpen=1, Elder=1 => Low Charges • After the center opened, the elderly enjoyed low charges • AfterOpen=0, Elder=1 => High Charges • Before the center opened, the elderly was associated with high charges • Results from OLAP Statistical Tests • Found 1761 pairs with p-value<0.01 and support>=5% • Walk-in, insurance (commercial/medicare) => charges(high/low) • Amount of total charges to patient depends on his/her insurance when the admission source is a walk-in • AorticDiagnosis=0, AdmissionSource (Walk-in / Transfer) => lengthOfStay (low / high) • If diagnosis is not aortic disease, then the length of stay depends on how the patient was admitted. 23/45
Machine Learning techniques • PCA • Regression: Linear and Logistic • Naïve Bayes • Bayesian classification
Principal Component Analysis • Dimensionality reduction technique for high-dimensional data (e.g. microarray data). • Exploratory data analysis, by finding hidden relationships between attributes. Assumptions: • Linearity of the data. • Statistical importance of mean and covariance. • Large variances have important dynamics.
Principal Component Analysis • Rotation of the input space to eliminate redundancy. • Most variance is preserved. • Minimal correlation between attributes. • UTX is a new rotated space. • Select the kth most representative components of U. (k<d) • Solving PCA is equivalent to solve SVD, defined by the eigen-problem: X=UEVT XXT=UE2UT U: left eigenvectors E: the eigenvalues V: the right eigenvectors
Linear Regression • There are two main applications for linear regression: • Prediction or forecasting of the output or variable of interest Y • Fit a model from the observed Y and the input variables X. • For values of X given without its accompanying value of Y, the model can be used to make a prediction of the output of interest Y. • Given an input data X={x1,x2,…,xn}, with d dimensions Xa, and the response or variable of interest Y. • Linear regression finds a set of coefficients β to model: Y = β0+β1X1+…+βdXd+ɛ.
Linear Regression with SSVS • Bayesian variable selection • Quantify the strength of the relationship between Y and a number of explanatory variables Xa. • Assess which Xa may have no relevant relationship with Y. • Identify which subsets of the Xa contain redundant information about Y. • The goal is to find the subset of explanatory variables Xγ which best predicts the output Y, with the regression model Y = βγXγ+ɛ. • We use Gibbs sampling, which is an MCMC algorithm, to estimate the probability distribution π(γ|Y,X) of a model to fit the output variable Y. • Other techniques, like stepwise variable selection, perform a partial search to find the model that better explains the output variable. • Stochastic Search Variable Selection finds best “likely” subset of variables based on posterior probabilities.
Linear Regression in the DBMS • Bayesian variable selection is implemented completely inside the DBMS with SQL and UDFs for efficient use of memory and processor resources. • Our algorithms and storage layouts for tables in the DBMS have a representative impact on execution performance. • Compared to the statistical package R, our implementations scale to large data sets.
Linear regression: Experimental results Cancer microarray data, where gamma are the gene numbers.
Logistic Regression Similar to linear regression. The data is fitted to a logistic curve. This technique is used for the prediction of probability of occurrence of an event. P(Y=1|x) = π(x) π(x) =1/(1+e-g(x)) , where g(x)= β0+β1X1+β2X2+…+βdXd
Logistic Regression:Experimental results Model: • med655 • Train • n = 491 • d = 15 • y = LAD>=70% • Test • n = 164
Naïve Bayes (NB) • Naïve Bayes is one of the most popular classifiers • Easy to understand. • Produces a simple model structure. • It is robust and has a solid mathematical background. • Can be computed incrementally. • Classification is achieved in linear time. • However, it has an independence assumption.
Bayesian Classifier • Why Bayesian: • A Bayesian Classifier Based on Class Decomposition Using EM Clustering. • Robust models with good accuracy and low over-fit. • Classifier adapted to skewed distributions and overlapping set of data points by building local models based on clusters. • EM Algorithm used to fit the mixtures per class. • Bayesian Classifier is composed of a mixture of k distributions or clusters per class.
Bayesian ClassifierBased on K-Means (BKM) • Motivation • Bayesian Classifiers are accurate and efficient. • A Generalization of the Naïve Bayes algorithm. • Model accuracy can be tuned varying number of clusters, setting class priors and making a probability-based decision. • EM is a distance based clustering algorithm. • Two phases involved in building the predictive model • Building the predictive model. • Scoring a new data set based on the computed predictive model.
Example • Medical Dataset is used with 655 rows n with varying number of clusters k. • This Dataset has 25 dimensions d which includes diseases to be predicted, risk factors and perfusion measurements. • Dimensions having null values have been replaced with the mean of that dimension. • Here, we predict accuracy for LAD, RCA (2 diseases). • Accuracy is good for maximum k = 8.
Example: medical med655 • n = 655 • d = 15 • g= 0,1 • G represents if the patient developed heart disease or not. wbcancer • n = 569 • d = 7 • g= 0,1 • G represents if the cancer is benign or malignant. • Features describe the characteristics of cell nuclei obtained from image of breast mass.
BKM & NB Models BKM: med655 BKM: wbcancer NB: wbcancer NB: med655
Cluster Means and Weights • Means are assigned around the global mean based on Gaussian initialization. • Table below shows means of clusters having 9 dimensions (d). • The weight of a cluster is given by 1.0/k, where k is the number of clusters.
The DBMS Group • Students: • Zhibo Chen • Carlos Garcia-Alvarado • Mario Navas • Sasi Kumar Pitchaimalai • Ahmad Qwasmeh • Rengan Xu • Manish Limaye
Publications • Ordonez C., Chen Z., Evaluating Statistical Tests on OLAP Cubes to Compare Degree of Disease, IEEE Transactions on Information Technology in Biomedicine 13(5): 756-765 (2009) • Chen Z., Ordonez C., Zhao K., Comparing Reliability of Association Rules and OLAP Statistical Tests. ICDM Workshops 2008: 8-17 • Ordonez, C., Zhao, K., A Comparison between Association Rules and Decision Trees to Predict Multiple Target Attributes, Intelligent Data Analysis (IDA), to appear in 2011. • Navas, M., Ordonez, C., Baladandayuthapani, V., On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs, IEEE ICDM Conference, 2010 • Navas, M., Ordonez, C., Baladandayuthapani, V., Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs, Proc. ACM KDD Workshop on Large-scale Data Mining: Theory and Applications (LDMTA), 2010 • Ordonez C., Pitchaimalai, S.K. Bayesian Classifiers Programmed in SQL, IEEE Transactions on Knowledge and Data Engineering (TKDE) 22(1): 139-144 (2010) • Pitchaimalai, S.K., Ordonez, C., Garcia-Alvarado, C., Comparing SQL and MapReduce to compute Naive Bayes in a Single Table Scan, Proc. ACM CIKM Workshop on Cloud Data Management (CloudDB), 2010 • Navas M., Ordonez C., Efficient computation of PCA with SVD in SQL. KDD Workshop on Data Mining using Matrices and Tensors 2009