440 likes | 1.14k Views
Data Mining and Bioinformatics. April 30, 2004. What is Data Mining?. Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute)
E N D
Data Mining and Bioinformatics April 30, 2004
What is Data Mining? • Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute) • Example: detecting suspicious transactions with credit cards
A Newer Definition • Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.
The “Beers and Diapers” Story • Analyze sales records • Beers & diapers frequently occur together in customer orders • Put beers next to diapers • Sales volume increases dramatically • Explanation?
Why Do Data Mining • Do you know the differences between the following concepts? • Data • Information • Knowledge • Difference between data mining and data analysis • The latter is more specific
What do We Aim to Mine? • Relationships and summaries • Models (global summary of a data set) • Linear equations, clusters, graphs, tree structures • Prediction, classification, interpretation • Patterns (local, restricted regions) • Recurrent patterns, rules • Unusualness - Anomaly detection • Analogy to data compression
The Whole KDD Process • KDD: Knowledge Discovery in Databases • Selecting the target data • Preprocessing the data • Transforming them if necessary • Performing data mining to extract patterns and relationships • Interpreting and assessing the discovered structures
Data Mining Techniques • Many of them originate from statistics, machine learning, or pattern recognition • General steps • Determine the nature and structure of the represenation to be used • Deciding how to quantify and compare how well different representations fit the data (score function) • Choose an algorithm process to optimize the score function • Deciding what principles of data management are required to implement the algorithm efficiently • Example: Regression analysis X = aY + b • Credit card spending vs Annual income
Techniques • Regression/Fitting • Clustering • Neural networks • Bayesian networks • Hidden Markov models
Naïve Bayesian - Continued • 9 yes samples (out of 14): • 2 sunny, 3 cool, 3 high, 2 true • Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053 • 5 no samples (out of 14): • 3 sunny, 1 cool, 4 high, 3 true • Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206 • Yes / No = 20.5% / 79.5%
Clustering • Iterative clustering • K-means • Hierarchical clustering • Agglomerative method • Probabilistic model-based clustering • EM (Expectation Minimization)
Data Mining Applications • Interdisciplinary • statistics, databases, machine learning, pattern recognition, AI, visualization, etc • Applications: • Marketing – sales model, Finance – loan decision • Insurance – risk analysis, Telecom – load predication • Web/text mining, Surveillance – security • Bioinformatics …
In Bioinformatics • Analysis of Microarray Data • Mining free text • Structural genomics – protein crystallization • Predicting structure from sequence • Common theme: complex data, fast growing (outgrowing our processing power)
Data Collection and Preprocessing • Microarray Expression Data • Fluorescence level • Noisy
Machine Learning Tasks • Design of Microarrays • Probes (67 features) w/ fluorescence value learn to choose the best probes for a new gene • Biological Applications of Microarrays • Classify new examples • Prediction the functional category of genes • Cluster genes based on similarity • Cluster experimental conditions • Learn a Bayesian network (that captures the joint prob distribution over the expression levels of genes)
Machine Learning Tasks (cont’d) • Medical Applications of Microarrays • Cell disease classification • Predicting existing disease classes • Predicting the prognsis • Predicting the drug response of different patients
Wrap It Up • Data mining has great potential • Danger: don’t over predict • S&P index = function of the previous year’s butter production, cheese production, sheep population in Bangladesh and US? • Finally - don’t expect it to answer all questions