DATA MINING: Algorithms, Applications and Beyond

DATA MINING:Algorithms, Applications and Beyond Chandan K. ReddyDepartment of Computer ScienceWayne State University, Detroit, MI – 48202.

Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching

Lots of Data …. • Customer Transactions • Bioinformatics • Banking • Internet / Web • Biomedical Imaging

So What ????? • Computers have become cheaper and more powerful, so storage is not an issue • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all We are drowning in data, but starving for knowledge!!!

Data Mining is … • “the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from data” • “the science of extracting useful information from large data sets or databases” -Wikipedia.org • More appropriate term will be …. Knowledge Discovery in Databases

Steps in Knowledge Discovery

Steps in the KDD Procedure • Data Cleaning • (removal of noise and inconsistent records) • Data Integration • (combining multiple sources) • Data Selection • (only data relevant for the task are retrieved from the database) • Data Transformation • (converting data into a form more appropriate for mining) • Data Mining • (application of intelligent methods in order to extract data patterns) • Model Evaluation • (identification of truly interesting patterns representing knowledge) • Knowledge Presentation • (visualization or other knowledge presentation techniques)

What can Data mining do? • Figures out some intelligent ways of handling the data • Finds valuable information hidden in large volumes of data. • Analyze the data and find patterns and regularities in data. • Mining analogy: in a mining operation large amounts of low grade materials are sifted through in order to find something of value. • Identify some abnormal/suspicious activities • To provide guidelines to humans - what to look for in a dataset?

Related CS Topics Database Systems Pattern Recognition Optimization Data Mining Algorithms Artificial Intelligence Statistics Machine Learning Visualization

Typical Data Mining Tasks are … • Prediction Methods (You know what to look for) • Use some variables to predict unknown or future values of other variables. • Description Methods (you don’t know what to look for) • Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Basic components • Data Pre-processing • Data Visualization • Model Evaluation • Classification • Clustering • Association Analysis

Different kinds of Data • Record Data • Data Matrix • Document Data • Transaction Data • Graph Data • Ordered • Temporal Data • Sequence Data • Spatio-Temporal Data

Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

Transaction Data • A special type of record data, where • Each record (transaction) involves a set of items. • The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Graph Data • Data with Relationships among objects • Examples: (a) Generic Web Data (b) Citation DataAnalysis

Ordered Data • Time Series data – series of some measurements taken over certain time frame • E.g. financial Data

Ordered Data • Sequence data – no time stamps, but order is still important. E.g. Genome data

Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean collected for a variety of geographical locations ( a total of 250,000 data points)

Data Pre-Processing • Removal of noise and outliers • Will improve the performance of mining • Sampling is employed for data selection • Processing entire Data might be expensive • Dealing with High-dimensional data • Curse of dimensionality • Data Normalization • Different features have different range values e.g. human age, height, weight. • Feature Selection • Remove unnecessary features – redundant or irrelevant

Data Visualization • Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data itemsor attributes can be analyzed or reported. Histograms Pie Chart

Scatter Plot Array of Iris Attributes

Celsius Contour Plot Example:

Parallel Coordinates Plots for Iris Data

Chernoff Faces for Iris Data Setosa Versicolour Virginica

All, All, All A Sample Data Cube Total annual sales of TV in U.S.A. Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum

Classification Training Algorithm Training Phase Learn Model Apply Model Result Existing Data New Data ??? Testing Phase

Classification models Outlook Sunny Rainy Overcast Windy Humidity Yes True False High Normal No Yes No Yes

Metrics for Performance Evaluation Most widely-used metric:

Evaluating Data Mining techniques • Predictive Accuracy (ability of a model to predict future) or • Descriptive Quality (ability of a model to find meaningful descriptions of the data, e.g. clusters) • Speed(computation cost involved in generating and using the model) • Robustness (ability of a model to work well even with noisy or missing data) • Scalability(ability of a model to scale up well with large amounts of data) • Interpretability(level of understanding and insight provided by the model)

Clustering • No class Labels – so, no prediction • Groupings in the data (descriptive) • Can be used to summarize the data • Can help in removing outliers and noise • Image segmentation, document clustering, gene expression data etc..

Association Analysis • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Probabilistic Graphical Models • Real World Data is very complicated • We would like to understand the underlying distribution that generated the data • If it is unimodal, then it is easy to solve • But, usually the distribution is multimodal – not unimodal

Parameter Estimation • Modeling with Probabilistic Graphical Models • Mixture Models • Hidden Markov Models • Mixture-of-Experts • Bayesian Networks • Mixture of Factor Analyzers • Neural Networks • And so on….. We don’t want Sub-optimal models

Example

Motivation ? “Searching for a needle in hay stack” ? ? ? ? ?

Problems with Local Optimization X Local methods suffer from “fine-tuning” capability and there is a need for a method that explores a subspace in a systematic manner. X X

TRUST-TECH Approach X X X X X X X X X X Systematic Tier-by-Tier search

Mixture Models • Let x = [ x1, x2,…, xd ] T be the d - dimensional feature vector • Assumption :K components in the mixture model. • Let  = { 1, 2,…, k, 1, 2,…, k } represent the collection of parameters

Maximum Likelihood Estimation • Let X = { x(1), x(2),…, x(n) } be the set of n i.i.d samples • Goal : Find  that maximizes the likelihood function • Difficulty :(i) No closed-form solution and (ii) The likelihood surface is highly nonlinear

EM Algorithm • Initialization : Set the initial parameters  • Iteration : Iterate the following until convergence • E-Step :Compute the Q-function i.e. expectation of the log likelihood given the current parameters • M-Step :Maximize the Q-function with respect to 

Nonlinear Transformation one-to-one correspondence of the critical points Dynamical System Original Function [ JCB ’06 ] Local Minimum Stable Equilibrium Point Saddle Point Decomposition Point Local Maximum Source Likelihood Function Energy Function

Experimental Results [ IEEE PAMI ’08 ]

Finding Motifs using Probabilistic Models

Results

Results Different Motifs and the average score using random starts. The first tier and second tier improvements [ BMC AMB ’06 ]

Neural Network Diagram Inputs : xi Output : y Weights : wij Biases : bi Targets : t # of Input Nodes : n # of Hidden Layers : 1 # of Hidden Nodes : k # of Output Nodes : 1

Results – Classification Error (%) [ IJCNN ’07 ]

DATA MINING: Algorithms, Applications and Beyond

DATA MINING: Algorithms, Applications and Beyond

Presentation Transcript

An Overview of Pitch Detection Algorithms

DATA MINING Introductory and Advanced Topics Part II

Knime: a data mining platform

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 —

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Text Retrieval Algorithms

Data Mining Chapter 1

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Algorithms for Discovering Patterns in Sequences

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Weka – A Data Mining Toolkit

Data Mining: Concepts and Techniques

Algorithms and Data Structures (CSC112)

Spatial Data Mining: Accomplishments and Research Needs

Data Mining: Concepts and Techniques

DATA WAREHOUSING AND DATA MINING

BUDT 725: Models and Applications in Operations Research

Genetic Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

15-826: Multimedia Databases and Data Mining