650 likes | 779 Views
Biological Data Mining Cheminformatics 1 Patrick Hoffman. Tonight's Topics. Review Lab (Excel, Weka, R-project, Clementine) SarToxDem4.zip Review Regression ? SarPredict Classify? Flattening – exploding Best SarPredict Classifier (Naïve Bayes, others in Weka) Naïve Bayes – explained
E N D
Biological Data Mining Cheminformatics 1 Patrick Hoffman
Tonight's Topics • Review Lab (Excel, Weka, R-project, Clementine) SarToxDem4.zip • Review • Regression ? • SarPredict Classify? • Flattening – exploding • Best SarPredict Classifier (Naïve Bayes, others in Weka) • Naïve Bayes – explained • R-code for probability density pnorm,dnorm • Association Rules • Predictive Tox – SarToxDem • Data – Isis Keys, Molconz descriptors • PCA, MDS, Sammon plots Other Clusterering techniques – Comparison ??
Lab – Understand & find best classifier • Download SarTox-Dem2.zip • Unzip (SarTox-Dem2.csv) • load into Excel (modify?) • load into Weka (visualize) • load into Clementine (output to table) • Load into R-project • filename <- "c:/MLCourse/SarTox-Dem4.csv" • csv <- read.csv(filename) • attach(csv) • Histograms of Act-5/BAct-5 (Excel, R, Clementine - overlays)
Example - SAR Data (SarPredict.csv) • Structural Activity Relationship • 960 chemicals (records) • 26 data fields (variables ) • 11 Biological Activity measures • 11 Chemical descriptors • 4 Quality Control variables
Regression vs Classification • Regression was hard !!! • On Active vs Inactive – Two class problem • Easier problem • Problems • R-Groups are text strings (Flatten or explode) • Unbalanced classes • Naïve Bayes
Naïve Bayes with 11 chemical descriptors Correctly Classified Instances 923 96.1458 % Incorrectly Classified Instances 37 3.8542 % Kappa statistic 0.0494 K&B Relative Info Score -68022.5337 % K&B Information Score -166.1076 bits -0.173 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 203.4613 bits 0.2119 bits/instance Complexity improvement (Sf) 27.3938 bits 0.0285 bits/instance Mean absolute error 0.0677 Root mean squared error 0.1858 Relative absolute error 87.8712 % Root relative squared error 95.2927 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 1 0.974 0.961 1 0.98 Inactive 0.026 0 1 0.026 0.051 Active === Confusion Matrix === a b <-- classified as 922 0 | a = Inactive 37 1 | b = Active
Flattening? – Exploding? 4 Categorial columns to:
Naïve Bayes with 36 chemical descriptors Correctly Classified Instances 912 95 % Incorrectly Classified Instances 48 5 % Kappa statistic 0.2475 K&B Relative Info Score -72819.8932 % K&B Information Score -177.8225 bits -0.1852 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 249.0367 bits 0.2594 bits/instance Complexity improvement (Sf) -18.1817 bits -0.0189 bits/instance Mean absolute error 0.067 Root mean squared error 0.1959 Relative absolute error 87.051 % Root relative squared error 100.4683 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active === Confusion Matrix === a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active Over all accuracy is worse but Active class is better
Naïve Bayes – 36 no normalization Correctly Classified Instances 878 91.4583 % Incorrectly Classified Instances 82 8.5417 % Kappa statistic 0.3111 K&B Relative Info Score -135125.7751 % K&B Information Score -329.9703 bits -0.3437 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 382.5589 bits 0.3985 bits/instance Complexity improvement (Sf) -151.7039 bits -0.158 bits/instance Mean absolute error 0.0968 Root mean squared error 0.2564 Relative absolute error 125.6851 % Root relative squared error 131.4682 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.928 0.421 0.982 0.928 0.954 Inactive 0.579 0.072 0.25 0.579 0.349 Active === Confusion Matrix === a b <-- classified as 856 66 | a = Inactive 16 22 | b = Active
Voting Feature Intervals Voting feature intervals classifier Time taken to build model: 0.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 728 75.8333 % Incorrectly Classified Instances 232 24.1667 % Kappa statistic 0.1168 K&B Relative Info Score -909872.6115 % K&B Information Score -2221.8629 bits -2.3144 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 759.1737 bits 0.7908 bits/instance Complexity improvement (Sf) -528.3187 bits -0.5503 bits/instance Mean absolute error 0.357 Root mean squared error 0.4215 Relative absolute error 463.5091 % Root relative squared error 216.1394 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.762 0.342 0.982 0.762 0.858 Inactive 0.658 0.238 0.102 0.658 0.177 Active === Confusion Matrix === a b <-- classified as 703 219 | a = Inactive 13 25 | b = Active
Part Classifier? Time taken to build model: 1.48 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 912 95 % Incorrectly Classified Instances 48 5 % Kappa statistic 0.2475 K&B Relative Info Score -30559.9665 % K&B Information Score -74.6259 bits -0.0777 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 18419.7155 bits 19.1872 bits/instance Complexity improvement (Sf) -18188.8604 bits -18.9467 bits/instance Mean absolute error 0.0641 Root mean squared error 0.214 Relative absolute error 83.283 % Root relative squared error 109.7552 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active === Confusion Matrix === a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active
Duplicate Active from 38 to 304 Time taken to build model: 0.06 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 1029 83.9315 % Incorrectly Classified Instances 197 16.0685 % Kappa statistic 0.5996 K&B Relative Info Score 60187.846 % K&B Information Score 486.9928 bits 0.3972 bits/instance Class complexity | order 0 990.6589 bits 0.808 bits/instance Class complexity | scheme 1151.1811 bits 0.939 bits/instance Complexity improvement (Sf) -160.5222 bits -0.1309 bits/instance Mean absolute error 0.1714 Root mean squared error 0.3647 Relative absolute error 45.9283 % Root relative squared error 84.4536 % Total Number of Instances 1226 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.856 0.211 0.925 0.856 0.889 Inactive 0.789 0.144 0.643 0.789 0.709 Active === Confusion Matrix === a b <-- classified as 789 133 | a = Inactive 64 240 | b = Active
Naïve Bayes Classifier First, what is a Bayes Classifier ? Bayes Theorem P(Ck|x) =p(x|Ck)P(Ck)/p(x) Ck =Class x=Attribute vector P = posterior probability p = unconditional density
Bayes Classifier P(Ck|x) > P(Cj|x) Simply chose the class having largest posterior probability given the feature vector x same as p(x|Ck)P(Ck) > p(x|Cj)P(Cj) problem What is p(x|Ck) and p(x|Cj) ?
If one knew the real density function there would be no problem. • 1. if enough data, build histograms • 2. guess the distribution (Gaussian?) • 3. calculate mean and Std. Dev. Normally, p(x|Ck) is a multi-variate joint probability densityfunction • 4. Use a Parametric method, the mean and Std. Dev. would be parameters used to calculate p(x|Ck)
The Standard Normal or Gaussian Density function of a single variable
Problems • The above gives d(d+3)/2 parameters to estimate the joint density function • Time consuming, difficult, and might not be the correct density function • Many of the dimensions or attributes might be independent?
Why not build a d-dimensional histogram for each Class ? This would approximate the joint density function.
Curse of dimensionality!!! • say 10 bins (or values) for each dimension (attribute) • d=2, 100 bins • d=3, 1000 bins : d=4, 10000, etc.. • all that multiplied by number of classes • Usually not enough data or time • Not enough data to fill the bins
Naïve or Simple Bayes is the answer. • Assume all dimensions or attributes are independent ! • Simple Probability Product Rule • P(X|Ck) -->P(A1|Ck) *P(A2|Ck)* ...P(Ad|Ck) • for d attributes • one can estimate P(Ai|Ck) as gaussian or • build a histogram for each attribute in a training set • 10 dimensions, 10 bins becomes 102 not 1010
Discretization (binning) is better • Building the Histograms is better if you have enough data • MLC++ has both Naïve Bayes (assumes gaussian) and Discrete Naïve Bayes • Several Binning techniques (see Kohavi) • Entropy based method very good (see http://www.cs.uml.edu/~fjara/mineset/id3/id3_example/id3example.html
Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability data(iris) #get data attach(iris) #make names available hi=5 #set hi limit for possible dimensions lo=0 #set lo limit a1=Sepal.Length[1:50] # a vector of the data for each class a2=Sepal.Length[51:100] a3=Sepal.Length[101:150] b1=Sepal.Width[1:50] b2=Sepal.Width[51:100] b3=Sepal.Width[101:150] c1=Petal.Width[1:50] c2=Petal.Width[51:100] c3=Petal.Width[101:150] d1=Petal.Length[1:50] d2=Petal.Length[51:100] d3=Petal.Length[101:150] #gets probability of each dimension of each class being in a certain range p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1)) p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2)) p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3)) p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1)) p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2)) p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3)) p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1)) p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2)) p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3)) p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1)) p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2)) p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3)) psetosa = p1setosa*p2setosa*p3setosa*p4setosa pversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolor pvirginica = p1virginica*p2virginica*p3virginica*p4virginica psetosa pversicolor pvirginica
Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability data(iris) #get data attach(iris) #make names available hi=5 #set hi limit for possible dimensions lo=0 #set lo limit # a vector of the data for each class a1=Sepal.Length[1:50] a2=Sepal.Length[51:100] a3=Sepal.Length[101:150] b1=Sepal.Width[1:50] b2=Sepal.Width[51:100] b3=Sepal.Width[101:150] c1=Petal.Width[1:50] c2=Petal.Width[51:100] c3=Petal.Width[101:150] d1=Petal.Length[1:50] d2=Petal.Length[51:100] d3=Petal.Length[101:150]
Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability #gets probability of each dimension of each class being in a certain range p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1)) p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2)) p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3)) p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1)) p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2)) p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3)) p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1)) p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2)) p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3)) p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1)) p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2)) p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3)) psetosa = p1setosa*p2setosa*p3setosa*p4setosa pversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolor pvirginica = p1virginica*p2virginica*p3virginica*p4virginica psetosa pversicolor pvirginica
Better NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability ## A better way using loops csv=iris N = ncol(csv) # get number of columns N = N - 1 # don't do class column R = nrow(csv) stats = matrix(0,N,3) # store the probabilities for each class and each dimension probs = matrix(1,3,1) # final probabilities for each class #loop for 3 classes for (lp2 in 1:3) { # get mean and sd for each class and each dimension # loop for each dimension for (lp1 in 1:N){ clix1 = (lp2-1)*50 +1 clix2 = clix1+49 d1 = csv[clix1:clix2,lp1] #where each class data is m = mean(d1) s = sd(d1) stats[lp1,lp2] = pnorm(hi,m,s) - pnorm(lo,m,s) probs[lp2] = probs[lp2]*stats[lp1,lp2] } } stats probs psetosa pversicolor Pvirginica
This SAR Example • Regression failed • Classification failed • Any other machine learning tricks?
Association rules • Look for possible rules that have high Confidence and Support • There are many, a good method will let you specify what you are looking for • These are small little pieces of the dimensional space • Binning or Discretization is usually necessary – best is smart or entropy binning Example S5 > 6.405 and 'R3 fmla' = "CN-" and 'R4 fmla' = "C4H9-"
A. Rules – so far Clementine GRI is the best (does binning) • Rules below are only for selection index Active • Support = percent of instances in the dataset where antecedents are true • Confidence = percentage of support instances where consequent is true
Association Rules - Be Wary • In the dataset there are only 8 instances where the confidence is = 100 • Some of the rules can be redundant • Looking at output one might think there are 15 instances • Generally one wants “good” rules where both support and confidence is high.
Example – Predictive Toxicology • Project Objective • Understand the relationship between chemical structure and liver isozyme inhibition. • Data Overview • 100,000 chemicals (records) • 280 data fields (variables) • 1 biological assay • 4 liver isozyme assays • 275 chemical descriptors • 166 Substructure Search Keys – ISIS/Host • 109 Electro-topological State Indicators – MolConnZ
Smaller version - SarTox-Dem4.csv • 82 keys • 76 descriptors • 1550 instances (records, rows) • 1 id column • 5 activity measurements • 5 binned activity measurements • Analyze/Predict last column Act-5, BAct-5 • 1280 toxic and 269 toxic
Analysis Stages • Metadata Overview & Data Cleansing • Isozyme activity binning • Classifying & Clustering • Association Rules • TTEST/Feature Reduction • Visualizations
Metadata Overview & Cleansing • 10 ISIS keys and 5 MolConn Z descriptors had zero values. • In our analyses, these fields were eliminated from the dataset, thereby reducing the number of descriptors and keys to 260. • Many records contained missing values: • Biological Assay: ~49,000 • Isozyme 1: ~50,000 Isozyme 3: ~55,000 • Isozyme 2: ~50,000 Isozyme 4: ~50,000 • About 24,000 records have all values of the biological activity and four liver isozymes
+ Correlation No Correlation – Correlation Pearson Cross Correlations 260 Descriptors 260 Descriptors
150 Chemical Classes Key off Key on 166 ISIS Keys ISIS Keys with PatchGrid™ Shows ISIS Key Composition of each Chemical Class
150 Chemical Classes Key off Key on 166 ISIS Keys ISIS Keys with PatchGrid™ ISIS Keys Clustered to show chemical classes with similar keys
Low Inhibition Bio Activity High Inhibition Low Activity High Activity Data Binning Isozyme 1 Isozyme 2 Isozyme 3 Isozyme 4
Medium Low High Descriptor A Isozyme 1 Inhibition = high if: key1 > 0.5 & key 2 > 0.5 & Descriptor A > 3.6 Key 1 Key 2 Association Rules - Visually
High Inhibitionof sub-selection in all classes Class with high % inhibition Biological Activityof sub-selection High Inhibition of sub-selection Sub-Selection Overview
Narrowing Down • Identify Important Dimensions in sub-selection • Apply Important Dimensions fromAssociation Rules • Select a single chemical class with high% inhibition • Use TTest or F-test to Reduce Keys and descriptors
False Positives False Negatives Classification Via Cost Matrix Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 96.4% Overall 0 1 19579 308 | a = nontoxic98.5% Nontoxic .978 1 0 439 696 | b = toxic61.3% Toxic 79.9% Class average Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 86.5% Overall 0 1 17239 2648 | a = nontoxic86.7% Nontoxic .989 100 0 183 952 | b = toxic83.9% Toxic 85.3% Class average Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 70.4% Overall 0 1 13753 6134 | a = nontoxic69.2% Nontoxic .993 13753/(13753+97) 500 0 97 1038 | b = toxic91.5% Toxic 80.3% Class average Precision is the percentage of chemicals classified as nontoxic that actually are nontoxic.
Naïve Bayes – all 158 attrib. Classifier: NaiveBayes -x 10 -v -o -i -k -t Correctly Classified Instances 1245 91.6789 % Incorrectly Classified Instances 113 8.3211 % Kappa statistic 0.7071 K&B Relative Info Score 76254.3843 % K&B Information Score 457.5652 bits 0.3369 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 3181.6444 bits 2.3429 bits/instance Complexity improvement (Sf) -2370.4169 bits -1.7455 bits/instance Mean absolute error 0.0856 Root mean squared error 0.2781 Relative absolute error 34.4662 % Root relative squared error 78.9588 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.888 0.078 0.658 0.888 0.756 Toxic 0.922 0.112 0.98 0.922 0.95 Non-Toxic === Confusion Matrix === a b <-- classified as 175 22 | a = Toxic 91 1070 | b = Non-Toxic
NB with top 20 attrib. (TTest) Classifier: NaiveBayes -x 10 -v -o -i -k -t === Stratified cross-validation === Correctly Classified Instances 1274 93.8144 % Incorrectly Classified Instances 84 6.1856 % Kappa statistic 0.7453 K&B Relative Info Score 90035.977 % K&B Information Score 540.2618 bits 0.3978 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 1578.972 bits 1.1627 bits/instance Complexity improvement (Sf) -767.7445 bits -0.5653 bits/instance Mean absolute error 0.0648 Root mean squared error 0.233 Relative absolute error 26.0612 % Root relative squared error 66.1472 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.761 0.032 0.802 0.761 0.781 Toxic 0.968 0.239 0.96 0.968 0.964 Non-Toxic === Confusion Matrix === a b <-- classified as 150 47 | a = Toxic 37 1124 | b = Non-Toxic
Best Logistic with only 20 attributes. Classifier: Logistic -x 10 -v -o -i -k -t === Stratified cross-validation === Correctly Classified Instances 1319 97.1281 % Incorrectly Classified Instances 39 2.8719 % Kappa statistic 0.8789 K&B Relative Info Score 107051.552 % K&B Information Score 642.364 bits 0.473 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 192.6762 bits 0.1419 bits/instance Complexity improvement (Sf) 618.5513 bits 0.4555 bits/instance Mean absolute error 0.0465 Root mean squared error 0.1547 Relative absolute error 18.7318 % Root relative squared error 43.9414 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.848 0.008 0.949 0.848 0.895 Toxic 0.992 0.152 0.975 0.992 0.983 Non-Toxic === Confusion Matrix === a b <-- classified as 167 30 | a = Toxic 9 1152 | b = Non-Toxic
Principle Component Analysis • Linear transformations • Creating new dimensions that are linear combinations of the old attributes (dimensions) • The new dimensions are chosen to maximize the variation of the data • PC1 and PC2 contain the most variation of the data • Typically one plots PC1 vs PC2 and shows class labels (however they might not separate classes the best) • There will be N components, N is the max of the rows and columns of the data. • Also called Singular Value Decomposition, essentially finding the Eigen values of a matrix
Principle Component Analysis • Plotting one PC vs another PC can be considered “Clustering” using Euclidian distance measure. • One can then view the class labels to see how good the clustering was. • Possibly making it into a “visual classifier” • It takes some work to know the “important” attributes • It is also “Feature Reduction” since might use only the first few PC’s for classification. • One can even “Mix” PC’s with original attributes
sarTox-dem4.csv PCA Using all 158 keys and descriptors does not help.