1 / 44

Introduction to Data Mining

Introduction to Data Mining. Donghui Zhang CCIS, Northeastern University. http://www.cs.uiuc.edu/~hanj. The current talk slide was extracted and modified from Dr. Han’s lecture slides. Motivation. Data explosion problem

Download Presentation

Introduction to Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Introduction to Data Mining Donghui Zhang CCIS, Northeastern University Data Mining: Concepts and Techniques

  2. http://www.cs.uiuc.edu/~hanj The current talk slide was extracted and modified from Dr. Han’s lecture slides. Data Mining: Concepts and Techniques

  3. Motivation • Data explosion problem • Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories • We are drowning in data, but starving for knowledge! • Solution: Data warehousing and data mining • Data warehousing and on-line analytical processing • Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Data Mining: Concepts and Techniques

  4. Evolution of Database Technology • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: • Data mining, data warehousing, multimedia databases, and Web databases • 2000s • Stream data management and mining • Data mining with a variety of applications • Web technology and global information systems Data Mining: Concepts and Techniques

  5. Data Mining: Confluence of Multiple Disciplines Database Systems Statistics Data Mining Machine Learning Visualization Algorithm Other Disciplines Data Mining: Concepts and Techniques

  6. What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data • Data mining: a misnomer? • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: Is everything “data mining”? • (Deductive) query processing. • Expert systems or small ML/statistical programs Data Mining: Concepts and Techniques

  7. Why Data Mining?—Potential Applications • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • DNA and bio-data analysis Data Mining: Concepts and Techniques

  8. Data Mining: A KDD Process Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases Data Mining: Concepts and Techniques

  9. Steps of a KDD Process • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge Data Mining: Concepts and Techniques

  10. Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases Data Mining: Concepts and Techniques

  11. Data Mining: On What Kinds of Data? • Relational database • Data warehouse • Transactional database • Advanced database and information repository • Object-relational database • Spatial and temporal data • Time-series data • Stream data • Multimedia database • Heterogeneous and legacy database • Text databases & WWW Data Mining: Concepts and Techniques

  12. Data Mining Functionalities • Concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions • Association (correlation and causality) • Diaper à Beer [0.5%, 75%] • Classification and Prediction • Construct models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on climate, or classify cars based on gas mileage • Presentation: decision-tree, classification rule, neural network • Predict some unknown or missing numerical values Data Mining: Concepts and Techniques

  13. Data Mining Functionalities (2) • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Maximizing intra-class similarity & minimizing interclass similarity • Mining complex types of data Data Mining: Concepts and Techniques

  14. 1. Concept Description • Descriptive vs. predictive data mining • Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms • Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data • Concept description: • Characterization: provides a concise and succinct summarization of the given collection of data • Comparison: provides descriptions comparing two or more collections of data

  15. Class Characterization: An Example Initial Relation Prime Generalized Relation

  16. Customer buys both Customer buys diaper Customer buys beer 2. Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%) Data Mining: Concepts and Techniques

  17. Apriori: A Candidate Generation-and-test Approach • Any subset of a frequent itemset must be frequent • if {beer, diaper, nuts} is frequent, so is {beer, diaper} • Every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB Data Mining: Concepts and Techniques

  18. The Apriori Algorithm—An Example Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan Data Mining: Concepts and Techniques

  19. Sequential Pattern Mining • Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern Data Mining: Concepts and Techniques

  20. 3. Classification & Prediction • Classification: • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis Data Mining: Concepts and Techniques

  21. Training Dataset This follows an example from Quinlan’s ID3 Data Mining: Concepts and Techniques

  22. Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes Data Mining: Concepts and Techniques

  23. Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left Data Mining: Concepts and Techniques

  24. Other Classification Techniques • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining Data Mining: Concepts and Techniques

  25. 4. Cluster Analysis • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

  26. What Is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Data Mining: Concepts and Techniques

  27. Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Data Mining: Concepts and Techniques

  28. The K-Means Partitioning Algorithm • Given k, the k-means algorithm is implemented in four steps: • Partition objects into k nonempty subsets • Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) • Assign each object to the cluster with the nearest seed point • Go back to Step 2, stop when no more new assignment Data Mining: Concepts and Techniques

  29. 5. Mining Complex Types of Data • Mining spatial databases • Mining multimedia databases • Mining time-series and sequence data • Mining stream data • Mining text databases • Mining the World-Wide Web Data Mining: Concepts and Techniques

  30. E.g. Mining Time-Series: two tasks Time-series plot Data Mining: Concepts and Techniques

  31. Task one: Trend analysis • Predict whether increase or decrease • Long-term or trend movements (trend curve) • Cyclic movements or cycle variations, e.g., business cycles • Seasonal movements or seasonal variations • i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. • Irregular or random movements Data Mining: Concepts and Techniques

  32. Task two: Similarity Search • Normal database query finds exact match • Similarity search finds data sequences that differ only slightly from the given query sequence • Two categories of similarity queries • find a sequence that is similar to the query sequence • find all pairs of similar sequences Data Mining: Concepts and Techniques

  33. Data Warehouse Data Mining: Concepts and Techniques

  34. What is Data Warehouse? • Defined in many different ways, but not rigorously. • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is asubject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision-making process.”—W. H. Inmon • Data warehousing: • The process of constructing and using data warehouses Data Mining: Concepts and Techniques

  35. Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Data Mining: Concepts and Techniques

  36. item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city state_or_province country branch_key branch_name branch_type Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures Data Mining: Concepts and Techniques

  37. supplier item time item_key item_name brand type supplier_key supplier_key supplier_type time_key day day_of_the_week month quarter year city location branch location_key street city_key city_key city state_or_province country branch_key branch_name branch_type Example of Snowflake Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures Data Mining: Concepts and Techniques

  38. item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location location_key street city province_or_state country shipper branch shipper_key shipper_name location_key shipper_type branch_key branch_name branch_type Example of Fact Constellation Shipping Fact Table time_key Sales Fact Table item_key time_key shipper_key item_key from_location branch_key to_location location_key dollars_cost units_sold units_shipped dollars_sold avg_sales Measures Data Mining: Concepts and Techniques

  39. Multidimensional Data • Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Region Industry Region Year Category Country Quarter Product City Month Week Office Day Product Month Data Mining: Concepts and Techniques

  40. Cuboids & Cube all 0-D(apex) cuboid region product month 1-D cuboids product, month product, region month, region 2-D cuboids 3-D(base) cuboid product, month, region Data Mining: Concepts and Techniques

  41. OLAP Server Architectures • Relational OLAP (ROLAP) • Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces • Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services • greater scalability • Multidimensional OLAP (MOLAP) • Array-based multidimensional storage engine (sparse matrix techniques) • fast indexing to pre-computed summarized data • Hybrid OLAP (HOLAP) • User flexibility, e.g., low level: relational, high-level: array • Specialized SQL servers • specialized support for SQL queries over star/snowflake schemas Data Mining: Concepts and Techniques

  42. Data Warehouse Back-End Tools and Utilities • Data extraction: • get data from multiple, heterogeneous, and external sources • Data cleaning: • detect errors in the data and rectify them when possible • Data transformation: • convert data from legacy or host format to warehouse format • Load: • sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions • Refresh • propagate the updates from the data sources to the warehouse Data Mining: Concepts and Techniques

  43. Summary • Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • Data mining functionalities: characterization, association, classification, clustering, mining complex data, etc. • Data warehousing Data Mining: Concepts and Techniques

  44. Where to Find Data Mining Papers • Data mining and KDD (SIGKDD: CDROM) • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. • Journal: Data Mining and Knowledge Discovery, KDD Explorations • Database systems (SIGMOD: CD ROM) • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA • Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc. • AI & Machine Learning • Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc. • Journals: Machine Learning, Artificial Intelligence, etc. • Statistics • Conferences: Joint Stat. Meeting, etc. • Journals: Annals of statistics, etc. • Visualization • Conference proceedings: CHI, ACM-SIGGraph, etc. • Journals: IEEE Trans. visualization and computer graphics, etc. Data Mining: Concepts and Techniques

More Related