Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.edu www.cs.umn.edu/~kumar

Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns What is Data Mining?

Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data • Yahoo! collects 10GB/hour • purchases at department/grocery stores • Walmart records  20 million transactions per day • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • NASA EOSDIS archives over 1-petabytes of earth science data / year • telescopes scanning the skies • Sky survey data • gene expression data • scientific simulations • terabytes of data generated in a few hours • Data mining may help scientists • in automated analysis of massive data sets • in hypothesis formation

Data Mining for Biomedical Informatics • Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques • Gene and protein sequences • Gene-expression data • Biological networks and phylogenetic profiles • Electronic Medical Records • IBM-Mayo clinic partnership has created a DB of 5 million patients • Single Nucleotides Polymorphisms (SNPs) • Data mining offers potential solution for analysis of large-scale data • Automated analysis of patients history for customized treatment • Prediction of the functions of anonymous genes • Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network

Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems

Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk

How can data mining help biologists? • Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: • Protein complexes (clustering and association patterns) • Gene regulatory networks (predictive models) • Protein structure/function (predictive models) • Motifs (association patterns) • We will look at two examples: • Clustering of ESTs and microarray data • Identifying protein functional modules from protein complexes

Model for predicting credit worthiness Employed Yes No No Education { High school, Graduate Undergrad } Number of Yes years > 7 yrs < 7 yrs Yes No Predictive Modeling: Classification • Find a model for class attribute as a function of the values of other attributes Class

Test Set Model General Approach for Building Classification Model quantitative categorical categorical class Learn Classifier Training Set

Classification Techniques • Base Classifiers • Decision Tree based Methods • Rule-based Methods • Nearest-neighbor • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • Ensemble Classifiers • Boosting, Bagging, Random Forests

Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying intruders in the cyberspace

Classification for Protein Function Prediction • Features could be several motifs, structure, domains, etc. • Important features (motifs, conserved domains etc) can be extracted using automated feature extraction techniques and using domain knowledge • Classification model can then be built on the extracted features and trained using several annotated proteins • Two proteins sharing significantly common features have a greater probability of having the same function • Several supervised and unsupervised schemes can be applied for this purpose

Employed Yes Yes No No Education Worthy: 4 Not Worthy:3 Worthy: 4 Not Worthy:3 Worthy: 0 Not Worthy:3 Worthy: 0 Not Worthy:3 Graduate High School/ Undergrad Worthy: 2 Not Worthy:2 Worthy: 2 Not Worthy:4 Not Worthy Worthy 4 3 Employed = Yes Key Computation 0 3 Employed = No Constructing a Decision Tree Employed

Constructing a Decision Tree Employed = Yes Employed = No

Design Issues of Decision Tree Induction • How should training records be split? • Method for specifying test condition • depending on attribute types • Measure for evaluating the goodness of a test condition • How should the splitting procedure stop? • Stop splitting if all the records belong to the same class or have identical attribute values • Early termination

Rule-based Classifier (Example)

Application of Rule-Based Classifier • A rule rcovers an instance x if the attributes of the instance satisfy the condition of the rule The rule r1 covers a hawk => Bird The rule r3 covers the grizzly bear => Mammal

Ordered Rule Set • Rules are rank ordered according to their priority • An ordered rule set is known as a decision list • When a test record is presented to the classifier • It is assigned to the class label of the highest ranked rule it has triggered • If none of the rules fired, it is assigned to the default class

Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck

Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes

Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:

Using Bayes Theorem for Classification • Approach: • compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem • Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd) • Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y) • How to estimate P(X1, X2, …, Xd | Y )?

Artificial Neural Networks (ANN) Training ANN means learning the weights of the neurons Activation Functions

Perceptron • Single layer network • Contains only input and output nodes • Activation function: g = sign(wx) • Learning: Initialize the weights (w0, w1, …, wd) • Repeat • For each training example (xi, yi) • Compute f(w, xi) • Update the weights: • Until stopping condition is met

Example of Perceptron Learning

Perceptron Learning Rule • Since f(w,x) is a linear combination of input variables, decision boundary is linear • For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly, and hence more powerful learning techniques are needed

Characteristics of ANN • Multilayer ANN are universal approximators • Can handle redundant attributes because weights are automatically learnt • Gradient descent may converge to local minimum • Model building can be very time consuming, but testing can be very fast

Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines

Which one is better? B1 or B2? How do you define better? Support Vector Machines

Find hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines

Support Vector Machines

Evaluating Classifiers • Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Accuracy • Most widely-used metric:

Problem with Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • This is misleading because the model does not detect any class 1 example • Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

Alternative Measures

ROC (Receiver Operating Characteristic) • A graphical approach for displaying trade-off between detection rate and false alarm rate • Developed in 1950s for signal detection theory to analyze noisy signals • ROC curve plots TPR against FPR • Performance of a model represented as a point in an ROC curve • Changing the threshold parameter of classifier changes the location of the point

ROC Curve (TPR,FPR): • (0,0): declare everything to be negative class • (1,1): declare everything to be positive class • (1,0): ideal • Diagonal line: • Random guessing • Below diagonal line: • prediction is opposite of the true class

ROC (Receiver Operating Characteristic) • To draw ROC curve, classifier must produce continuous-valued output • Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record • Many classifiers produce only discrete outputs (i.e., predicted class) • How to get continuous-valued outputs? • Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

Methods for Classifier Evaluation • Holdout • Reserve k% for training and (100-k)% for testing • Random subsampling • Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Bootstrap • Sampling with replacement • .632 bootstrap:

Methods for Comparing Classifiers • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set?

Data Mining Book For further details and sample chapters see www.cs.umn.edu/~kumar/dmbook

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

Presentation Transcript

Computing Patterns in Biology

Fishing for Patterns in Data Streams

Searching for applications of EVT in biology

System for Troubleshooting Big Data Applications in Large Scale Data Centers

Dramatically increase performance for large amounts of data in Xaml GridView and ListView

New (Applications of) Compiler Techniques for Data Grids

Making sense of large amounts of molecular data

Applications of HMMs in Computational Biology

Advanced Techniques in Molecular Biology

Finding patterns in large, real networks

Handling Large Amounts of Biological Data

New (Applications of) Compiler Techniques for Data Grids

Fishing for Patterns in Data Streams

Data in Biology

TECHNIQUES IN MOLECULAR BIOLOGY

The Analytic System: Finding Patterns in the Data

Advanced Techniques in Biology & Medicine

Applications of HMMs in Computational Biology

IN-SPIRE VISUALATION Making Sense out of Large Amounts of Data

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

Techniques in Molecular Biology

Maintenance Patterns of large-scale PHP Web Applications