630 likes | 808 Views
DATA MINING: Algorithms, Applications and Beyond. Chandan K. Reddy Department of Computer Science Wayne State University, Detroit, MI – 48202. Organization. Introduction Basic components Fundamental Topics Classification Clustering Association Analysis Research Topics
E N D
DATA MINING:Algorithms, Applications and Beyond Chandan K. ReddyDepartment of Computer ScienceWayne State University, Detroit, MI – 48202.
Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching
Lots of Data …. • Customer Transactions • Bioinformatics • Banking • Internet / Web • Biomedical Imaging
So What ????? • Computers have become cheaper and more powerful, so storage is not an issue • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all We are drowning in data, but starving for knowledge!!!
Data Mining is … • “the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from data” • “the science of extracting useful information from large data sets or databases” -Wikipedia.org • More appropriate term will be …. Knowledge Discovery in Databases
Steps in the KDD Procedure • Data Cleaning • (removal of noise and inconsistent records) • Data Integration • (combining multiple sources) • Data Selection • (only data relevant for the task are retrieved from the database) • Data Transformation • (converting data into a form more appropriate for mining) • Data Mining • (application of intelligent methods in order to extract data patterns) • Model Evaluation • (identification of truly interesting patterns representing knowledge) • Knowledge Presentation • (visualization or other knowledge presentation techniques)
What can Data mining do? • Figures out some intelligent ways of handling the data • Finds valuable information hidden in large volumes of data. • Analyze the data and find patterns and regularities in data. • Mining analogy: in a mining operation large amounts of low grade materials are sifted through in order to find something of value. • Identify some abnormal/suspicious activities • To provide guidelines to humans - what to look for in a dataset?
Related CS Topics Database Systems Pattern Recognition Optimization Data Mining Algorithms Artificial Intelligence Statistics Machine Learning Visualization
Typical Data Mining Tasks are … • Prediction Methods (You know what to look for) • Use some variables to predict unknown or future values of other variables. • Description Methods (you don’t know what to look for) • Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Basic components • Data Pre-processing • Data Visualization • Model Evaluation • Classification • Clustering • Association Analysis
Different kinds of Data • Record Data • Data Matrix • Document Data • Transaction Data • Graph Data • Ordered • Temporal Data • Sequence Data • Spatio-Temporal Data
Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes
Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.
Transaction Data • A special type of record data, where • Each record (transaction) involves a set of items. • The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
Graph Data • Data with Relationships among objects • Examples: (a) Generic Web Data (b) Citation DataAnalysis
Ordered Data • Time Series data – series of some measurements taken over certain time frame • E.g. financial Data
Ordered Data • Sequence data – no time stamps, but order is still important. E.g. Genome data
Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean collected for a variety of geographical locations ( a total of 250,000 data points)
Data Pre-Processing • Removal of noise and outliers • Will improve the performance of mining • Sampling is employed for data selection • Processing entire Data might be expensive • Dealing with High-dimensional data • Curse of dimensionality • Data Normalization • Different features have different range values e.g. human age, height, weight. • Feature Selection • Remove unnecessary features – redundant or irrelevant
Data Visualization • Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data itemsor attributes can be analyzed or reported. Histograms Pie Chart
Celsius Contour Plot Example:
Chernoff Faces for Iris Data Setosa Versicolour Virginica
All, All, All A Sample Data Cube Total annual sales of TV in U.S.A. Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum
Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching
Classification Training Algorithm Training Phase Learn Model Apply Model Result Existing Data New Data ??? Testing Phase
Classification models Outlook Sunny Rainy Overcast Windy Humidity Yes True False High Normal No Yes No Yes
Metrics for Performance Evaluation Most widely-used metric:
Evaluating Data Mining techniques • Predictive Accuracy (ability of a model to predict future) or • Descriptive Quality (ability of a model to find meaningful descriptions of the data, e.g. clusters) • Speed(computation cost involved in generating and using the model) • Robustness (ability of a model to work well even with noisy or missing data) • Scalability(ability of a model to scale up well with large amounts of data) • Interpretability(level of understanding and insight provided by the model)
Clustering • No class Labels – so, no prediction • Groupings in the data (descriptive) • Can be used to summarize the data • Can help in removing outliers and noise • Image segmentation, document clustering, gene expression data etc..
Association Analysis • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk}, Implication means co-occurrence, not causality!
Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching
Probabilistic Graphical Models • Real World Data is very complicated • We would like to understand the underlying distribution that generated the data • If it is unimodal, then it is easy to solve • But, usually the distribution is multimodal – not unimodal
Parameter Estimation • Modeling with Probabilistic Graphical Models • Mixture Models • Hidden Markov Models • Mixture-of-Experts • Bayesian Networks • Mixture of Factor Analyzers • Neural Networks • And so on….. We don’t want Sub-optimal models
Motivation ? “Searching for a needle in hay stack” ? ? ? ? ?
Problems with Local Optimization X Local methods suffer from “fine-tuning” capability and there is a need for a method that explores a subspace in a systematic manner. X X
TRUST-TECH Approach X X X X X X X X X X Systematic Tier-by-Tier search
Mixture Models • Let x = [ x1, x2,…, xd ] T be the d - dimensional feature vector • Assumption :K components in the mixture model. • Let = { 1, 2,…, k, 1, 2,…, k } represent the collection of parameters
Maximum Likelihood Estimation • Let X = { x(1), x(2),…, x(n) } be the set of n i.i.d samples • Goal : Find that maximizes the likelihood function • Difficulty :(i) No closed-form solution and (ii) The likelihood surface is highly nonlinear
EM Algorithm • Initialization : Set the initial parameters • Iteration : Iterate the following until convergence • E-Step :Compute the Q-function i.e. expectation of the log likelihood given the current parameters • M-Step :Maximize the Q-function with respect to
Nonlinear Transformation one-to-one correspondence of the critical points Dynamical System Original Function [ JCB ’06 ] Local Minimum Stable Equilibrium Point Saddle Point Decomposition Point Local Maximum Source Likelihood Function Energy Function
Experimental Results [ IEEE PAMI ’08 ]
Results Different Motifs and the average score using random starts. The first tier and second tier improvements [ BMC AMB ’06 ]
Neural Network Diagram Inputs : xi Output : y Weights : wij Biases : bi Targets : t # of Input Nodes : n # of Hidden Layers : 1 # of Hidden Nodes : k # of Output Nodes : 1
Results – Classification Error (%) [ IJCNN ’07 ]