Data Mining and Bioinformatics

Data Mining and Bioinformatics April 30, 2004

What is Data Mining? • Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute) • Example: detecting suspicious transactions with credit cards

A Newer Definition • Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

The “Beers and Diapers” Story • Analyze sales records • Beers & diapers frequently occur together in customer orders • Put beers next to diapers • Sales volume increases dramatically • Explanation?

Why Do Data Mining • Do you know the differences between the following concepts? • Data • Information • Knowledge • Difference between data mining and data analysis • The latter is more specific

What do We Aim to Mine? • Relationships and summaries • Models (global summary of a data set) • Linear equations, clusters, graphs, tree structures • Prediction, classification, interpretation • Patterns (local, restricted regions) • Recurrent patterns, rules • Unusualness - Anomaly detection • Analogy to data compression

The Whole KDD Process • KDD: Knowledge Discovery in Databases • Selecting the target data • Preprocessing the data • Transforming them if necessary • Performing data mining to extract patterns and relationships • Interpreting and assessing the discovered structures

Data Mining Techniques • Many of them originate from statistics, machine learning, or pattern recognition • General steps • Determine the nature and structure of the represenation to be used • Deciding how to quantify and compare how well different representations fit the data (score function) • Choose an algorithm process to optimize the score function • Deciding what principles of data management are required to implement the algorithm efficiently • Example: Regression analysis X = aY + b • Credit card spending vs Annual income

Techniques • Regression/Fitting • Clustering • Neural networks • Bayesian networks • Hidden Markov models

Example: Naïve Bayesian

Naïve Bayesian - Continued • 9 yes samples (out of 14): • 2 sunny, 3 cool, 3 high, 2 true • Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053 • 5 no samples (out of 14): • 3 sunny, 1 cool, 4 high, 3 true • Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206 • Yes / No = 20.5% / 79.5%

Clustering • Iterative clustering • K-means • Hierarchical clustering • Agglomerative method • Probabilistic model-based clustering • EM (Expectation Minimization)

Data Mining Applications • Interdisciplinary • statistics, databases, machine learning, pattern recognition, AI, visualization, etc • Applications: • Marketing – sales model, Finance – loan decision • Insurance – risk analysis, Telecom – load predication • Web/text mining, Surveillance – security • Bioinformatics …

In Bioinformatics • Analysis of Microarray Data • Mining free text • Structural genomics – protein crystallization • Predicting structure from sequence • Common theme: complex data, fast growing (outgrowing our processing power)

Hybridization of Sample to Probe

Data Collection and Preprocessing • Microarray Expression Data • Fluorescence level • Noisy

Data Representations

Microarray Experiement Result

Machine Learning Tasks • Design of Microarrays • Probes (67 features) w/ fluorescence value  learn to choose the best probes for a new gene • Biological Applications of Microarrays • Classify new examples • Prediction the functional category of genes • Cluster genes based on similarity • Cluster experimental conditions • Learn a Bayesian network (that captures the joint prob distribution over the expression levels of genes)

A Support Vector Machine

Cluster Analysis

Bayesian Network

Machine Learning Tasks (cont’d) • Medical Applications of Microarrays • Cell disease classification • Predicting existing disease classes • Predicting the prognsis • Predicting the drug response of different patients

Disease Diagnosis Models

Factors That Affect Drug Response

Wrap It Up • Data mining has great potential • Danger: don’t over predict • S&P index = function of the previous year’s butter production, cheese production, sheep population in Bangladesh and US? • Finally - don’t expect it to answer all questions

Data Mining and Bioinformatics

Data Mining and Bioinformatics

Presentation Transcript

Data Mining and Bioinformatics

Applications of Data Mining and Machine Learning in Bioinformatics

Data Mining: Concepts and Techniques Mining Text Data

Applications to Bioinformatics: Microarray Data Mining

BioInformatics and Data Sharing

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding II

G53BIO – Bioinformatics Biological Data Mining

Data Mining: Concepts and Techniques Mining data streams

Graph mining in bioinformatics

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

Comparison of Data Mining Algorithms on Bioinformatics Dataset

Applications to Bioinformatics: Microarray Data Mining

Data Mining for BioInformatics at Ewha CSE

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding I

Data Mining – Basics of Bioinformatics

Data Mining: Concepts and Techniques Mining data streams

Applications to Bioinformatics: Microarray Data Mining