Introduction to Data Mining

Introduction to Data Mining Michael R. Wick Professor and Chair Department of Computer Science University of Wisconsin – Eau Claire Eau Claire, WI 54701 wickmr@uwec.edu 715-836-2526

Acknowledgements Some of the material used in this talk is drawn from: • Dr. Jiawei Han at University of Illinois at Urbana Champaign • Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas) • Dr. Chris Clifton, Indiana Center for Database Systems, Purdue University

Road Map • Definition and Need • Applications • Process • Types • Example: The Apriori Algorithm • State of Practice • Related Techniques • Data Preprocessing

What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining: a misnomer • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: Is everything “data mining”? • (Deductive) query processing. • Expert systems or small learning programs

What is Data Mining?Real Example from the NBA • Play-by-play information recorded by teams • Who is on the court • Who shoots • Results • Coaches want to know what works best • Plays that work well against a given team • Good/bad player matchups • Advanced Scout (from IBM Research) is a data mining tool to answer these questions Starks+Houston+Ward playing http://www.nba.com/news_feat/beyond/0126.html

Necessity for Data Mining • Large amounts of current and historical data being stored • Only small portion (~5-10%) of collected data is analyzed • Data that may never be analyzed is collected in the fear that something that may prove important will be missed • As databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored data • Data sources • Health-related services, e.g., benefits, medical analyses • Commercial, e.g., marketing and sales • Financial • Scientific, e.g., NASA, Genome • DOD and Intelligence • Desired analyses • Support for planning (historical supply and demand trends) • Yield management (scanning airline seat reservation data to maximize yield per seat) • System performance (detect abnormal behavior in a system) • Mature database analysis (clean up the data sources)

Potential Applications • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection • Finding outliers in credit card purchases • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • DNA and bio-data analysis

Interpretation/ Evaluation Data Mining Preprocessing Patterns Selection Preprocessed Data Data Target Data Knowledge Discovery in Databases: Process Knowledge adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Steps of a KDD Process • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing methods of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts DBA OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Multiple Perspectives in Data Mining • Data to be mined • Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Multiple/integrated functions and mining at multiple levels • Techniques utilized • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. • Applications adapted • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.

Ingredients of an Effective KDD Process “In order to discover anything, you must be looking for something.” Murphy’s 1st Law of Serendipity Visualization and Human Computer Interaction Plan for Learning Generate and Test Hypotheses Discover Knowledge Determine Knowledge Relevancy Evolve Knowledge/ Data Goals for Learning Knowledge Base Database(s) Background Knowledge Discovery Algorithms

What Can Data Mining Do? • Clustering • Identify previously unknown groups • Classification • Give operational definitions to categories • Association • Find Association rules • Many others…

Clustering • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

Some Clustering Approaches • Iterative Distance-based Clustering • Specify in advance the number of desired clusters (k) • K random points chosen as cluster centers • Instances assigned to closest center • Centroid (or mean) of all points in cluster is calculated • Repeat until clusters are stable • Incremental Clustering • Uses tree to represent clusters • Nodes represent clusters (or subclusters) • Instances added one by one and tree updated • Updating can involve simple placement of instance in cluster or re-clustering • Uses category utility function to determine if instance fits with each cluster • Can result in merging or splitting of existing clusters • Category Utility • Uses quadratic loss function of conditional probabilities • Does the addition of new instance help us better predict the value of attributes for other instances?

General Applications of Clustering • Pattern Recognition • Spatial Data Analysis • create thematic maps in GIS by clustering feature spaces • detect spatial clusters and explain them in spatial data mining • Image Processing • Economic Science (especially market research) • WWW • Document classification • Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Classification (vs Prediction) • Classification: • predicts categorical class labels (discrete/nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Learns operational definition • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis

Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formula • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?

Classification Approaches • Divide and Conquer • Results in decision tree • Uses “information gain” function • Covering • - Select category for which to learn rule • - Add conditions on rule until “good enough”

Association • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents?

Why Is Association Mining Important? • Foundation for many essential data mining tasks • Association, correlation, causality • Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association • Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) • Broad applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis • Web log (click stream) analysis, DNA sequence analysis, etc.

Customer buys both Customer buys diaper Customer buys beer Basic Concepts:Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c, conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%)

For rule AC: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) =66.6% Mining Association Rules:Example Min. support 50% Min. confidence 50%

Apriori: A Candidate Generation-and-test Approach • Any subset of a frequent itemset must be frequent • if {beer, diaper, nuts} is frequent, so is {beer, diaper} • Every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB • Performance studies show its efficiency and scalability

The Apriori Algorithm—A Mathematical Definition • Let I = {a,b,c,…} be a set of all items in the domain • Let T = { S | S I } be a set of all transaction records of item sets • Let support(S) =  {A | A  T  S  A} | • Let L1 = { {a} | a  I  support({a})  minSupport } • k (k > 1  Lk-1  ) Let • Lk = { Si  Sj| (Si  Lk-1)  (Sj Lk-1)  • ( |Si – Sj| = 1 )  ( |Sj– Si| = 1)  • ( S[ ((S  Si  Sj)  (|S| = k-1))  S  Lk-1] )  • ( support(Si  Sj)  minSupport ) • Then, the set of all frequent item sets is given by • L = Lk • and the set of all association rules is given by • R = { A  C | A  (Lk)  (C = Lk – A)  (A  )  (C  ) • support(Lk) / support(A)  minConfidence }

The Apriori Algorithm—An Example • Example: minSupport = 2 • I= {Table Saw, Router, Kreg Jig, Sander, Drill Press} • T= { {Table Saw, Router, Drill Press}, • { Router, Sander }, • { Router, Kreg Jig }, • {Table Saw, Router, , Sander }, • {Table Saw, , Kreg Jig }, • { Router, Kreg Jig }, • {Table Saw, , Kreg Jig }, • {Table Saw, Router, Kreg Jig, , Drill Press}, • {Table Saw, Router, Kreg Jig } } • L1 = { {T}, {R}, {K}, {S}, {D} } • L2 = { {R,T}, {K,T}, {D,T}, {K,R}, {R,S}, {D,R} } • L3 = { {K,R,T}, {D,R,T} } • L4 =  • Rules = ????

The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • How to count supports of candidates? • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

State of Commercial/Research Practice • Increasing use of data mining systems in financial community, marketing sectors, retailing • Still have major problems with large, dynamic sets of data (need better integration with the databases) • Off-the-shelf data mining packages perform specialized learning on small subset of data • Most research emphasizes machine learning; little emphasis on database side (especially text) • People achieving results are not likely to share knowledge

Related Techniques: OLAPOn-Line Analytical Processing • On-Line Analytical Processing tools provide the ability to pose statistical and summary queries interactively • Traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries • Advantages relative to data mining • Can obtain a wider variety of results • Generally faster to obtain results • Disadvantages relative to data mining • User must “ask the right question” • Generally used to determine high-level statistical summaries, rather than specific relationships among instances

Integration of Data Mining and Data Warehousing • Data mining systems, DBMS, Data warehouse systems coupling • No coupling, loose-coupling, semi-tight-coupling, tight-coupling • On-line analytical mining data • integration of mining and OLAP technologies • Interactive mining multi-level knowledge • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Integration of multiple mining functions • Characterized classification, first clustering and then association

Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

Why Is Data Dirty? • Incomplete data comes from • n/a data value when collected • different consideration between the time when the data was collected and when it is analyzed. • human/hardware/software problems • Noisy data comes from the process of data • collection • entry • transmission • Inconsistent data comes from • Different data sources • Functional dependency violation

Why Is Data Preprocessing Important? • No quality data, no quality mining results! • Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon (father of the data warehouse)

Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data

Data Cleaning • Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration

Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred.

How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Fill in it automatically with • a global constant : e.g., “unknown”, a new class?! • the attribute mean • the attribute mean for all samples belonging to the same class: smarter • the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data

How to Handle Noisy Data? • Binning method: • first sort data and partition into (equi-depth) bins • then one can smooth by bin means,smooth by binmedian,smooth by bin boundaries, etc. • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers) • Regression • smooth by fitting the data into regression functions

Simple Discretization Methods: Binning • Equal-width (distance) partitioning: • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well. • Equal-depth (frequency) partitioning: • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky.

Thank you! Michael R. Wick Professor and Chair Department of Computer Science University of Wisconsin – Eau Claire Eau Claire, WI 54701 wickmr@uwec.edu 715-836-2526

Introduction to Data Mining