440 likes | 463 Views
Explore and analyze large quantities of data to extract meaningful patterns and discover previously unknown information. This book discusses the importance of data mining in both commercial and scientific fields, with a focus on applications in biology and biomedical informatics.
E N D
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.edu www.cs.umn.edu/~kumar
Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns What is Data Mining?
Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data • Yahoo! collects 10GB/hour • purchases at department/grocery stores • Walmart records 20 million transactions per day • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • NASA EOSDIS archives over 1-petabytes of earth science data / year • telescopes scanning the skies • Sky survey data • gene expression data • scientific simulations • terabytes of data generated in a few hours • Data mining may help scientists • in automated analysis of massive data sets • in hypothesis formation
Data Mining for Biomedical Informatics • Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques • Gene and protein sequences • Gene-expression data • Biological networks and phylogenetic profiles • Electronic Medical Records • IBM-Mayo clinic partnership has created a DB of 5 million patients • Single Nucleotides Polymorphisms (SNPs) • Data mining offers potential solution for analysis of large-scale data • Automated analysis of patients history for customized treatment • Prediction of the functions of anonymous genes • Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network
Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems
Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk
How can data mining help biologists? • Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: • Protein complexes (clustering and association patterns) • Gene regulatory networks (predictive models) • Protein structure/function (predictive models) • Motifs (association patterns) • We will look at two examples: • Clustering of ESTs and microarray data • Identifying protein functional modules from protein complexes
Model for predicting credit worthiness Employed Yes No No Education { High school, Graduate Undergrad } Number of Yes years > 7 yrs < 7 yrs Yes No Predictive Modeling: Classification • Find a model for class attribute as a function of the values of other attributes Class
Test Set Model General Approach for Building Classification Model quantitative categorical categorical class Learn Classifier Training Set
Classification Techniques • Base Classifiers • Decision Tree based Methods • Rule-based Methods • Nearest-neighbor • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • Ensemble Classifiers • Boosting, Bagging, Random Forests
Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying intruders in the cyberspace
Classification for Protein Function Prediction • Features could be several motifs, structure, domains, etc. • Important features (motifs, conserved domains etc) can be extracted using automated feature extraction techniques and using domain knowledge • Classification model can then be built on the extracted features and trained using several annotated proteins • Two proteins sharing significantly common features have a greater probability of having the same function • Several supervised and unsupervised schemes can be applied for this purpose
Employed Yes Yes No No Education Worthy: 4 Not Worthy:3 Worthy: 4 Not Worthy:3 Worthy: 0 Not Worthy:3 Worthy: 0 Not Worthy:3 Graduate High School/ Undergrad Worthy: 2 Not Worthy:2 Worthy: 2 Not Worthy:4 Not Worthy Worthy 4 3 Employed = Yes Key Computation 0 3 Employed = No Constructing a Decision Tree Employed
Constructing a Decision Tree Employed = Yes Employed = No
Design Issues of Decision Tree Induction • How should training records be split? • Method for specifying test condition • depending on attribute types • Measure for evaluating the goodness of a test condition • How should the splitting procedure stop? • Stop splitting if all the records belong to the same class or have identical attribute values • Early termination
Application of Rule-Based Classifier • A rule rcovers an instance x if the attributes of the instance satisfy the condition of the rule The rule r1 covers a hawk => Bird The rule r3 covers the grizzly bear => Mammal
Ordered Rule Set • Rules are rank ordered according to their priority • An ordered rule set is known as a decision list • When a test record is presented to the classifier • It is assigned to the class label of the highest ranked rule it has triggered • If none of the rules fired, it is assigned to the default class
Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck
Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes
Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:
Using Bayes Theorem for Classification • Approach: • compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem • Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd) • Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y) • How to estimate P(X1, X2, …, Xd | Y )?
Naïve Bayes Classifier • Assume independence among attributes Xi when class is given: • P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj) • Can estimate P(Xi| Yj) for all Xi and Yj from data • New point is classified to Yj if P(Yj) P(Xi| Yj) is maximal.
Artificial Neural Networks (ANN) Training ANN means learning the weights of the neurons Activation Functions
Perceptron • Single layer network • Contains only input and output nodes • Activation function: g = sign(wx) • Learning: Initialize the weights (w0, w1, …, wd) • Repeat • For each training example (xi, yi) • Compute f(w, xi) • Update the weights: • Until stopping condition is met
Perceptron Learning Rule • Since f(w,x) is a linear combination of input variables, decision boundary is linear • For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly, and hence more powerful learning techniques are needed
Characteristics of ANN • Multilayer ANN are universal approximators • Can handle redundant attributes because weights are automatically learnt • Gradient descent may converge to local minimum • Model building can be very time consuming, but testing can be very fast
Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines
Which one is better? B1 or B2? How do you define better? Support Vector Machines
Find hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines
Evaluating Classifiers • Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
Accuracy • Most widely-used metric:
Problem with Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • This is misleading because the model does not detect any class 1 example • Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)
ROC (Receiver Operating Characteristic) • A graphical approach for displaying trade-off between detection rate and false alarm rate • Developed in 1950s for signal detection theory to analyze noisy signals • ROC curve plots TPR against FPR • Performance of a model represented as a point in an ROC curve • Changing the threshold parameter of classifier changes the location of the point
ROC Curve (TPR,FPR): • (0,0): declare everything to be negative class • (1,1): declare everything to be positive class • (1,0): ideal • Diagonal line: • Random guessing • Below diagonal line: • prediction is opposite of the true class
ROC (Receiver Operating Characteristic) • To draw ROC curve, classifier must produce continuous-valued output • Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record • Many classifiers produce only discrete outputs (i.e., predicted class) • How to get continuous-valued outputs? • Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM
Methods for Classifier Evaluation • Holdout • Reserve k% for training and (100-k)% for testing • Random subsampling • Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Bootstrap • Sampling with replacement • .632 bootstrap:
Methods for Comparing Classifiers • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set?
Data Mining Book For further details and sample chapters see www.cs.umn.edu/~kumar/dmbook