560 likes | 742 Views
IE 483/583 Knowledge Discovery and Data Mining. Dr. Siggi Olafsson Fall 2003. What is Data Mining?. (… and should I be here?). Dilbert Replies . Some Definitions. “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.”
E N D
IE 483/583Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003 Data Mining
What is Data Mining? (… and should I be here?) Data Mining
Dilbert Replies ... Data Mining
Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.” Data Mining
What can Data Mining Do? • Classification • Prediction Supervised • Association discovery • Clustering Unsupervised Data Mining
Applications of Data Mining • Manufacturing Process Improvement • Sales and Marketing • Mapping the Human Genome • Diagnosing Breast Cancer • Financial Crime Identification • Portfolio Management Data Mining
Technical Background • Machine Learning • Data mining: business-oriented use of AI • Statistics • Regression, sampling, DOE, etc • Decision Support • Data warehousing, data marts, OLAP, etc • Interdisciplinary tools put together to form the process of knowledge discovery in databases … Data Mining
Historical Perspective < 40 Stat Bayes theorem, regression, etc. 40s AI Neural networks 50s AI Nearest neighbor, single link, perceptron Stat Resampling, bias reduction, jackknife 60s Stat Linear models for classification, exploratory data analysis (EDA) IR Similarity measures, clustering DB Relational data model 70s IR Smart IR systems AI Genetic algorithms Stat EM algorithm, k-means clustering 80s AI Kohonen maps, decision trees 90s DB Association rule algorithms, web & search engines, data warehousing, OLAP Data Mining
What Changed? • Very large databases • Increased computational power as enabler • Business perspective Data Mining
Knowledge Discovery in Databases Data Warehouse Systems Engineering Databases Data warehouse Prepared Data Knowledge Model/Structures Knowledge Discovery and Data Mining Data Mining
Course Information • We assume data is ready for mining • Thus, we focus on: • models and structures, and • algorithms • More information on course homepage http://www.public.iastate.edu/~olafsson/mining.html Data Mining
Course Outline • Introduction • Exploratory Data Mining • Supervised Learning • Unsupervised Learning • Optimization Methods in Learning • Selected Advanced Topics • Mining the Web • Customer Relationship Management (CRM) • Course Review Data Mining
Questions? Data Mining
Data Mining • Discover patterns in data • automatic or semi-automatic process • meaningful or useful pattern • large amounts of data • What does such a pattern look like? Black box Transparent box Data Mining
Describing Structural Patterns • Some ways of representing knowledge: • Decision tables • Decision trees • Classification rules • Association rules • Regression trees • Clusters Data Mining
The Weather Problem Data Mining
A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes • These are classification rules Data Mining
Association Rules • Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high Data Mining
Three Layers of the Process Inputs Outputs Algorithms Data Mining
Inputs • Three forms • Concepts • concept description - what you want to learn • Instances • examples - what you learn from • Attributes • features of instances - variables you have values for Data Mining
Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Data Mining
Instances: Learn from Examples • Set of instances to be classified, or associated, or clustered • Example of concept to be learned • Data set: flat file (single relation) • denormalization • Family tree example • concept: sister • example: family tree Data Mining
Family Tree = Data Mining
Denormalizing Relational Data Data Mining
Denormalization Problems • Computational and storage costs • Trivial regularities customers products product supplier supplier supplier address • Infinite relations Data Mining
Content of Instances: Attributes • Instance characterized by values of its (predefined) set of attributes • Numeric (“continuous”) • Nominal (categorical) • Ordinal (rank) • Interval • Ratio Focus in this class Data Mining
Data Preparation • Data … • assembly • set of instances/denormalizing relational data • integration • enterprise-wide database/data warehouse • cleaning • missing data • aggregation • good information Data Mining
ARFF Format • Used by JAVA package (Weka) • Independent, unordered instances • No relationship between instances Data Mining
Weather Data Data Mining
Features • % = comments • @relation <name> • @attribute <name> <type> • Attribute types: Nominal and numeric • @data • List of instances • Missing values represented by ? Data Mining
Other Issues • Missing data • Inaccurate values • Look at the data!!! Data Mining
Recall the Three Layers of the Data Mining Process Done Next Inputs Outputs (structural patterns) Algorithms Data Mining
Describing Structural Patterns • Ways of representing knowledge: • Decision tables • Decision trees • Classification rules • Association rules • Regression trees • Clusters Data Mining
The Weather Problem Data Mining
A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes Data Mining
A Decision Tree Outlook Overcast Sunny Rainy Humidity Windy Play=Yes High TRUE Play=No Play=No Data Mining
Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Data Mining
Classification Rules • Classification easily read off decision trees • How? • Other direction possible, but not as straightforward If a and b then x If c and d then x Data Mining
Corresponding Decision Tree a n y b c n n y y x c d n n y y d x n y x Data Mining
Replicated Subtree Problem X=1 n y Y=1 Y=1 n n y b a a b If x=1 and y=0 then a If x=0 and y=1 then a If x=0 and y=0 then b If x=1 and y=1 then b Data Mining
Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3 Data Mining
Rules with exceptions • Account for new instances • Exceptions from exceptions, etc If x and y then a EXCEPT if z then b Data Mining
Association Rules • Coverage (support): number of instances it predicts correctly • Accuracy (confidence): coverage divided by number of instances it applies to • Coverage = 4 • Accuracy = 100% If temperature = cool then humidity = normal Data Mining
Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny Data Mining
The Shapes Problem Shaded=standing Unshaded=lying Data Mining
Instances Data Mining
Classification Rules If width 3.5 and height < 7.0 then lying If height 3.5 then standing • Work well to classify these instances • Problems? Data Mining
Relational Rules If width > height then lying If height > width then standing • Rules comparing attributes to constants are called propositional rules • Structural patterns? Data Mining
CPU Performance Example Data Mining