Biological Data Mining Cheminformatics 1 Patrick Hoffman

Biological Data Mining Cheminformatics 1 Patrick Hoffman

Tonight's Topics • Review Lab (Excel, Weka, R-project, Clementine) SarToxDem4.zip • Review • Regression ? • SarPredict Classify? • Flattening – exploding • Best SarPredict Classifier (Naïve Bayes, others in Weka) • Naïve Bayes – explained • R-code for probability density pnorm,dnorm • Association Rules • Predictive Tox – SarToxDem • Data – Isis Keys, Molconz descriptors • PCA, MDS, Sammon plots Other Clusterering techniques – Comparison ??

Lab – Understand & find best classifier • Download SarTox-Dem2.zip • Unzip (SarTox-Dem2.csv) • load into Excel (modify?) • load into Weka (visualize) • load into Clementine (output to table) • Load into R-project • filename <- "c:/MLCourse/SarTox-Dem4.csv" • csv <- read.csv(filename) • attach(csv) • Histograms of Act-5/BAct-5 (Excel, R, Clementine - overlays)

Example - SAR Data (SarPredict.csv) • Structural Activity Relationship • 960 chemicals (records) • 26 data fields (variables ) • 11 Biological Activity measures • 11 Chemical descriptors • 4 Quality Control variables

Regression vs Classification • Regression was hard !!! • On Active vs Inactive – Two class problem • Easier problem • Problems • R-Groups are text strings (Flatten or explode) • Unbalanced classes • Naïve Bayes

Naïve Bayes with 11 chemical descriptors Correctly Classified Instances 923 96.1458 % Incorrectly Classified Instances 37 3.8542 % Kappa statistic 0.0494 K&B Relative Info Score -68022.5337 % K&B Information Score -166.1076 bits -0.173 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 203.4613 bits 0.2119 bits/instance Complexity improvement (Sf) 27.3938 bits 0.0285 bits/instance Mean absolute error 0.0677 Root mean squared error 0.1858 Relative absolute error 87.8712 % Root relative squared error 95.2927 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 1 0.974 0.961 1 0.98 Inactive 0.026 0 1 0.026 0.051 Active === Confusion Matrix === a b <-- classified as 922 0 | a = Inactive 37 1 | b = Active

Flattening? – Exploding? 4 Categorial columns to:

25 Binary columns

Naïve Bayes with 36 chemical descriptors Correctly Classified Instances 912 95 % Incorrectly Classified Instances 48 5 % Kappa statistic 0.2475 K&B Relative Info Score -72819.8932 % K&B Information Score -177.8225 bits -0.1852 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 249.0367 bits 0.2594 bits/instance Complexity improvement (Sf) -18.1817 bits -0.0189 bits/instance Mean absolute error 0.067 Root mean squared error 0.1959 Relative absolute error 87.051 % Root relative squared error 100.4683 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active === Confusion Matrix === a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active Over all accuracy is worse but Active class is better

Naïve Bayes – 36 no normalization Correctly Classified Instances 878 91.4583 % Incorrectly Classified Instances 82 8.5417 % Kappa statistic 0.3111 K&B Relative Info Score -135125.7751 % K&B Information Score -329.9703 bits -0.3437 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 382.5589 bits 0.3985 bits/instance Complexity improvement (Sf) -151.7039 bits -0.158 bits/instance Mean absolute error 0.0968 Root mean squared error 0.2564 Relative absolute error 125.6851 % Root relative squared error 131.4682 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.928 0.421 0.982 0.928 0.954 Inactive 0.579 0.072 0.25 0.579 0.349 Active === Confusion Matrix === a b <-- classified as 856 66 | a = Inactive 16 22 | b = Active

Voting Feature Intervals Voting feature intervals classifier Time taken to build model: 0.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 728 75.8333 % Incorrectly Classified Instances 232 24.1667 % Kappa statistic 0.1168 K&B Relative Info Score -909872.6115 % K&B Information Score -2221.8629 bits -2.3144 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 759.1737 bits 0.7908 bits/instance Complexity improvement (Sf) -528.3187 bits -0.5503 bits/instance Mean absolute error 0.357 Root mean squared error 0.4215 Relative absolute error 463.5091 % Root relative squared error 216.1394 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.762 0.342 0.982 0.762 0.858 Inactive 0.658 0.238 0.102 0.658 0.177 Active === Confusion Matrix === a b <-- classified as 703 219 | a = Inactive 13 25 | b = Active

Part Classifier? Time taken to build model: 1.48 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 912 95 % Incorrectly Classified Instances 48 5 % Kappa statistic 0.2475 K&B Relative Info Score -30559.9665 % K&B Information Score -74.6259 bits -0.0777 bits/instance Class complexity | order 0 230.8551 bits 0.2405 bits/instance Class complexity | scheme 18419.7155 bits 19.1872 bits/instance Complexity improvement (Sf) -18188.8604 bits -18.9467 bits/instance Mean absolute error 0.0641 Root mean squared error 0.214 Relative absolute error 83.283 % Root relative squared error 109.7552 % Total Number of Instances 960 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active === Confusion Matrix === a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active

Duplicate Active from 38 to 304 Time taken to build model: 0.06 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 1029 83.9315 % Incorrectly Classified Instances 197 16.0685 % Kappa statistic 0.5996 K&B Relative Info Score 60187.846 % K&B Information Score 486.9928 bits 0.3972 bits/instance Class complexity | order 0 990.6589 bits 0.808 bits/instance Class complexity | scheme 1151.1811 bits 0.939 bits/instance Complexity improvement (Sf) -160.5222 bits -0.1309 bits/instance Mean absolute error 0.1714 Root mean squared error 0.3647 Relative absolute error 45.9283 % Root relative squared error 84.4536 % Total Number of Instances 1226 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.856 0.211 0.925 0.856 0.889 Inactive 0.789 0.144 0.643 0.789 0.709 Active === Confusion Matrix === a b <-- classified as 789 133 | a = Inactive 64 240 | b = Active

Naïve Bayes Classifier First, what is a Bayes Classifier ? Bayes Theorem P(Ck|x) =p(x|Ck)P(Ck)/p(x) Ck =Class x=Attribute vector P = posterior probability p = unconditional density

If one knew the real density function there would be no problem. • 1. if enough data, build histograms • 2. guess the distribution (Gaussian?) • 3. calculate mean and Std. Dev. Normally, p(x|Ck) is a multi-variate joint probability densityfunction • 4. Use a Parametric method, the mean and Std. Dev. would be parameters used to calculate p(x|Ck)

The Standard Normal or Gaussian Density function of a single variable

Guassian Density of a multivariate distribution

Problems • The above gives d(d+3)/2 parameters to estimate the joint density function • Time consuming, difficult, and might not be the correct density function • Many of the dimensions or attributes might be independent?

Why not build a d-dimensional histogram for each Class ? This would approximate the joint density function.

Curse of dimensionality!!! • say 10 bins (or values) for each dimension (attribute) • d=2, 100 bins • d=3, 1000 bins : d=4, 10000, etc.. • all that multiplied by number of classes • Usually not enough data or time • Not enough data to fill the bins

Naïve or Simple Bayes is the answer. • Assume all dimensions or attributes are independent ! • Simple Probability Product Rule • P(X|Ck) -->P(A1|Ck) *P(A2|Ck)* ...P(Ad|Ck) • for d attributes • one can estimate P(Ai|Ck) as gaussian or • build a histogram for each attribute in a training set • 10 dimensions, 10 bins becomes 102 not 1010

Discretization (binning) is better • Building the Histograms is better if you have enough data • MLC++ has both Naïve Bayes (assumes gaussian) and Discrete Naïve Bayes • Several Binning techniques (see Kohavi) • Entropy based method very good (see http://www.cs.uml.edu/~fjara/mineset/id3/id3_example/id3example.html

Histogram for each class - Iris

Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability data(iris) #get data attach(iris) #make names available hi=5 #set hi limit for possible dimensions lo=0 #set lo limit a1=Sepal.Length[1:50] # a vector of the data for each class a2=Sepal.Length[51:100] a3=Sepal.Length[101:150] b1=Sepal.Width[1:50] b2=Sepal.Width[51:100] b3=Sepal.Width[101:150] c1=Petal.Width[1:50] c2=Petal.Width[51:100] c3=Petal.Width[101:150] d1=Petal.Length[1:50] d2=Petal.Length[51:100] d3=Petal.Length[101:150] #gets probability of each dimension of each class being in a certain range p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1)) p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2)) p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3)) p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1)) p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2)) p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3)) p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1)) p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2)) p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3)) p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1)) p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2)) p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3)) psetosa = p1setosa*p2setosa*p3setosa*p4setosa pversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolor pvirginica = p1virginica*p2virginica*p3virginica*p4virginica psetosa pversicolor pvirginica

Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability data(iris) #get data attach(iris) #make names available hi=5 #set hi limit for possible dimensions lo=0 #set lo limit # a vector of the data for each class a1=Sepal.Length[1:50] a2=Sepal.Length[51:100] a3=Sepal.Length[101:150] b1=Sepal.Width[1:50] b2=Sepal.Width[51:100] b3=Sepal.Width[101:150] c1=Petal.Width[1:50] c2=Petal.Width[51:100] c3=Petal.Width[101:150] d1=Petal.Length[1:50] d2=Petal.Length[51:100] d3=Petal.Length[101:150]

Simple NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability #gets probability of each dimension of each class being in a certain range p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1)) p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2)) p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3)) p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1)) p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2)) p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3)) p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1)) p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2)) p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3)) p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1)) p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2)) p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3)) psetosa = p1setosa*p2setosa*p3setosa*p4setosa pversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolor pvirginica = p1virginica*p2virginica*p3virginica*p4virginica psetosa pversicolor pvirginica

Better NB in R – What class is most likely to have all dimension between 0 and 5 pnorm calculates the cumulative probability ## A better way using loops csv=iris N = ncol(csv) # get number of columns N = N - 1 # don't do class column R = nrow(csv) stats = matrix(0,N,3) # store the probabilities for each class and each dimension probs = matrix(1,3,1) # final probabilities for each class #loop for 3 classes for (lp2 in 1:3) { # get mean and sd for each class and each dimension # loop for each dimension for (lp1 in 1:N){ clix1 = (lp2-1)*50 +1 clix2 = clix1+49 d1 = csv[clix1:clix2,lp1] #where each class data is m = mean(d1) s = sd(d1) stats[lp1,lp2] = pnorm(hi,m,s) - pnorm(lo,m,s) probs[lp2] = probs[lp2]*stats[lp1,lp2] } } stats probs psetosa pversicolor Pvirginica

This SAR Example • Regression failed • Classification failed • Any other machine learning tricks?

Association rules • Look for possible rules that have high Confidence and Support • There are many, a good method will let you specify what you are looking for • These are small little pieces of the dimensional space • Binning or Discretization is usually necessary – best is smart or entropy binning Example S5 > 6.405 and 'R3 fmla' = "CN-" and 'R4 fmla' = "C4H9-"

A. Rules – so far Clementine GRI is the best (does binning) • Rules below are only for selection index Active • Support = percent of instances in the dataset where antecedents are true • Confidence = percentage of support instances where consequent is true

Association Rules - Be Wary • In the dataset there are only 8 instances where the confidence is = 100 • Some of the rules can be redundant • Looking at output one might think there are 15 instances • Generally one wants “good” rules where both support and confidence is high.

Example – Predictive Toxicology • Project Objective • Understand the relationship between chemical structure and liver isozyme inhibition. • Data Overview • 100,000 chemicals (records) • 280 data fields (variables) • 1 biological assay • 4 liver isozyme assays • 275 chemical descriptors • 166 Substructure Search Keys – ISIS/Host • 109 Electro-topological State Indicators – MolConnZ

Smaller version - SarTox-Dem4.csv • 82 keys • 76 descriptors • 1550 instances (records, rows) • 1 id column • 5 activity measurements • 5 binned activity measurements • Analyze/Predict last column Act-5, BAct-5 • 1280 toxic and 269 toxic

Analysis Stages • Metadata Overview & Data Cleansing • Isozyme activity binning • Classifying & Clustering • Association Rules • TTEST/Feature Reduction • Visualizations

Metadata Overview & Cleansing • 10 ISIS keys and 5 MolConn Z descriptors had zero values. • In our analyses, these fields were eliminated from the dataset, thereby reducing the number of descriptors and keys to 260. • Many records contained missing values: • Biological Assay: ~49,000 • Isozyme 1: ~50,000 Isozyme 3: ~55,000 • Isozyme 2: ~50,000 Isozyme 4: ~50,000 • About 24,000 records have all values of the biological activity and four liver isozymes

+ Correlation No Correlation – Correlation Pearson Cross Correlations 260 Descriptors 260 Descriptors

150 Chemical Classes Key off Key on 166 ISIS Keys ISIS Keys with PatchGrid™ Shows ISIS Key Composition of each Chemical Class

150 Chemical Classes Key off Key on 166 ISIS Keys ISIS Keys with PatchGrid™ ISIS Keys Clustered to show chemical classes with similar keys

Low Inhibition Bio Activity High Inhibition Low Activity High Activity Data Binning Isozyme 1 Isozyme 2 Isozyme 3 Isozyme 4

Medium Low High Descriptor A Isozyme 1 Inhibition = high if: key1 > 0.5 & key 2 > 0.5 & Descriptor A > 3.6 Key 1 Key 2 Association Rules - Visually

High Inhibitionof sub-selection in all classes Class with high % inhibition Biological Activityof sub-selection High Inhibition of sub-selection Sub-Selection Overview

Narrowing Down • Identify Important Dimensions in sub-selection • Apply Important Dimensions fromAssociation Rules • Select a single chemical class with high% inhibition • Use TTest or F-test to Reduce Keys and descriptors

False Positives False Negatives Classification Via Cost Matrix Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 96.4% Overall 0 1 19579 308 | a = nontoxic98.5% Nontoxic .978 1 0 439 696 | b = toxic61.3% Toxic 79.9% Class average Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 86.5% Overall 0 1 17239 2648 | a = nontoxic86.7% Nontoxic .989 100 0 183 952 | b = toxic83.9% Toxic 85.3% Class average Cost Matrix Confusion Matrix Class Accuracy Precision ab <-- classified as 70.4% Overall 0 1 13753 6134 | a = nontoxic69.2% Nontoxic .993 13753/(13753+97) 500 0 97 1038 | b = toxic91.5% Toxic 80.3% Class average Precision is the percentage of chemicals classified as nontoxic that actually are nontoxic.

Naïve Bayes – all 158 attrib. Classifier: NaiveBayes -x 10 -v -o -i -k -t Correctly Classified Instances 1245 91.6789 % Incorrectly Classified Instances 113 8.3211 % Kappa statistic 0.7071 K&B Relative Info Score 76254.3843 % K&B Information Score 457.5652 bits 0.3369 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 3181.6444 bits 2.3429 bits/instance Complexity improvement (Sf) -2370.4169 bits -1.7455 bits/instance Mean absolute error 0.0856 Root mean squared error 0.2781 Relative absolute error 34.4662 % Root relative squared error 78.9588 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.888 0.078 0.658 0.888 0.756 Toxic 0.922 0.112 0.98 0.922 0.95 Non-Toxic === Confusion Matrix === a b <-- classified as 175 22 | a = Toxic 91 1070 | b = Non-Toxic

NB with top 20 attrib. (TTest) Classifier: NaiveBayes -x 10 -v -o -i -k -t === Stratified cross-validation === Correctly Classified Instances 1274 93.8144 % Incorrectly Classified Instances 84 6.1856 % Kappa statistic 0.7453 K&B Relative Info Score 90035.977 % K&B Information Score 540.2618 bits 0.3978 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 1578.972 bits 1.1627 bits/instance Complexity improvement (Sf) -767.7445 bits -0.5653 bits/instance Mean absolute error 0.0648 Root mean squared error 0.233 Relative absolute error 26.0612 % Root relative squared error 66.1472 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.761 0.032 0.802 0.761 0.781 Toxic 0.968 0.239 0.96 0.968 0.964 Non-Toxic === Confusion Matrix === a b <-- classified as 150 47 | a = Toxic 37 1124 | b = Non-Toxic

Best Logistic with only 20 attributes. Classifier: Logistic -x 10 -v -o -i -k -t === Stratified cross-validation === Correctly Classified Instances 1319 97.1281 % Incorrectly Classified Instances 39 2.8719 % Kappa statistic 0.8789 K&B Relative Info Score 107051.552 % K&B Information Score 642.364 bits 0.473 bits/instance Class complexity | order 0 811.2275 bits 0.5974 bits/instance Class complexity | scheme 192.6762 bits 0.1419 bits/instance Complexity improvement (Sf) 618.5513 bits 0.4555 bits/instance Mean absolute error 0.0465 Root mean squared error 0.1547 Relative absolute error 18.7318 % Root relative squared error 43.9414 % Total Number of Instances 1358 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.848 0.008 0.949 0.848 0.895 Toxic 0.992 0.152 0.975 0.992 0.983 Non-Toxic === Confusion Matrix === a b <-- classified as 167 30 | a = Toxic 9 1152 | b = Non-Toxic

Principle Component Analysis • Linear transformations • Creating new dimensions that are linear combinations of the old attributes (dimensions) • The new dimensions are chosen to maximize the variation of the data • PC1 and PC2 contain the most variation of the data • Typically one plots PC1 vs PC2 and shows class labels (however they might not separate classes the best) • There will be N components, N is the max of the rows and columns of the data. • Also called Singular Value Decomposition, essentially finding the Eigen values of a matrix

Principle Component Analysis • Plotting one PC vs another PC can be considered “Clustering” using Euclidian distance measure. • One can then view the class labels to see how good the clustering was. • Possibly making it into a “visual classifier” • It takes some work to know the “important” attributes • It is also “Feature Reduction” since might use only the first few PC’s for classification. • One can even “Mix” PC’s with original attributes

sarTox-dem4.csv PCA Using all 158 keys and descriptors does not help.

Biological Data Mining Cheminformatics 1 Patrick Hoffman

Biological Data Mining Cheminformatics 1 Patrick Hoffman

Presentation Transcript

EECS 800 Research Seminar Mining Biological Data

Biological literature mining

BIOLOGICAL Data Mining

Mining Biological Data

Data Mining 1

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

G53BIO – Bioinformatics Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

Lecture 1: Biological information database and data mining

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining

Elementary approach towards Biological Data Mining

Biological Data Mining

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

Biological Data Mining