740 likes | 751 Views
This tutorial provides an introduction to data mining based on the book "Data Mining Concepts and Techniques" by Jiawei Han and Micheline Kamber. It covers the motivation behind data mining, data collection and database creation, components of a data mining system, different types of data for mining, data warehouses, text and multimedia databases, functionalitie and patterns that can be mined, measures of pattern interestingness, classification of data mining systems, and major issues in data mining.
E N D
tutorial based on the book: Data Mining Concepts and Tehniques by Jiawei Han and Micheline Kamber This material was developed with financial help of the WUSA fund of Austria. made by Radmilo Pesic & Branko Golubovic
1 Introduction made by Radmilo Pesic & Branko Golubovic
What motivated data mining? Necessity is the mother of invention. Data Collection and Database Creation (1960s and earlier) Database Management Systems (1970s-early 1980s) Advanced Databases Systems (mid-1980s-present) Web-based Databases Systems (1990s-present) Data Warehousing and Data Mining (mid-1980s-present) New Generation of Integrated Information Systems (2000-…) made by Radmilo Pesic & Branko Golubovic
Knowledge What Is Data Mining? Extracting or “mining” knowledge from large amounts of data. 1. Data cleaning 2. Data integration 3. Data selection 4. Data transformation 5. Data mining 6. Pattern evaluation 7. Knowledge presentation Evaluation and Presentation Data Mining Patterns Selection and Transformation Data warehouse Cleaning and Integration Databases Flat files made by Radmilo Pesic & Branko Golubovic
Components of a typical data mining system: • Database, data warehouse, or other information repository • Database or data warehouse server • Knowledge base • Data mining engine • Pattern evaluation module • Graphical user interface Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or Data warehouse server Database Data warehouse made by Radmilo Pesic & Branko Golubovic
Data mining – On What Kind of Data? • Relational Databases • Data Warehouses • Transactional Databases • Advanced Database Systems and Advanced Database Applications (object-oriented, object-relational, spatial, temporal, time-series, text, multimedia, heterogeneus, legacy databases and the world wide web) made by Radmilo Pesic & Branko Golubovic
customer Relational Databases item employee branch purchases item_sold works_at made by Radmilo Pesic & Branko Golubovic
Clean Transform Integrate Load Client Data source in Chicago Data warehouse Data source in New York Query and analysis tools Data source in Toronto Client Data source in Vancouver Data Warehouses Typical architecture of a data warehouse for AllElectronics made by Radmilo Pesic & Branko Golubovic
Chicago 440 address (cities) New York 1560 Toronto 395 Vancouver <Vancouver,Q1,security> Q1 605 825 14 400 Q2 time (quarters) Q3 Q4 computer security home entertainment phone item (types) Drill-down on time data for Q1 Roll-up on address USA 2000 Chicago address (countries) Canada 1000 address (cities) New York Toronto Q1 Vancouver Jan 150 Q2 time (quarters) Feb 100 Q3 time (months) March 150 Q4 computer security computer security home entertainment phone home entertainment phone item (types) item (types) made by Radmilo Pesic & Branko Golubovic
Text Databases and Multimedia Databases • Text databases can be: highly unstructured, semistructured or well structured • Multimedia databases store image, audio, and video data • Such data require a lot of storage space; it’s continuous-media data Heterogeneus Databases and Legacy Databases The World Wide Web • mining path traversal patterns made by Radmilo Pesic & Branko Golubovic
Data Mining FunctionalitiesWhat Kinds of Patterns Can Be Mined? • Concept/Class Description: Characterization and Discrimination • Association Analysis • Classification and Prediction • Cluster Analysis • Outlier Analysis • Evolution Analysis made by Radmilo Pesic & Branko Golubovic
Are All of the Patterns Interesting? A pattern is interesiting if it is: • easily understood • valid • (potentially) useful • novel or if it • confirms user’s hypothesis Interesting pattern represents knowledge! made by Radmilo Pesic & Branko Golubovic
Objective measures of pattern interestingness: support confidence Subjective measures of pattern interestingness: data is unexpected data is actionable data is expected Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns? made by Radmilo Pesic & Branko Golubovic
Classification of Data Mining Systems Database technology Statistics Data Mining • according to kinds of databases mined (relational, data warehouse, object-oriented…) • according to kinds of knowledge mined (association, classification, clustering…; generalized, primitive-level or knowledge at multiple levels; regularities or irregularities) • according to the kinds of techniques utilized (autonomous, interactive exploratory or query-driven systems; data warehouse oriented, statistics…) • according to the applications adapted (for finance, DNA, etc.) Machine learning Information science Visualization Other disciplines made by Radmilo Pesic & Branko Golubovic
Major Issues in Data Mining Mining methodology and user interaction issues: • Mining different kinds of knowledge in databases • Interactive mining of knowledge at multiple levels of abstraction • Incorporation of background knowledge • Data mining query languages and ad hoc data mining • Presentation and visualization of data mining results • Handling noisy or incomplete data • Pattern evaluation – the interestingness problem Performance issues: • Efficiency and scalability of data mining algorithms • Parallel, distributed, and incremental mining algorithms Issues relating to the diversity of database types: • Handling of relational and complex types of data • Mining information from heterogeneous databases and global information systems made by Radmilo Pesic & Branko Golubovic
2 Data Warehouse and OLAP Technology for Data Mining made by Radmilo Pesic & Branko Golubovic
What Is a Data Warehouse? “A datawarehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.” W.H. Inmon • Subject-oriented • Integrated • Time-variant • Nonvolatile made by Radmilo Pesic & Branko Golubovic
How are organizations using the information from data warehouse? • Increasing customer focus • Repositioning products and managing product portfolios • Analyzing operations and looking for sources of profit • Managing the customer relationships, making environmental corrections, and managing the cost of corporate assets Different approach to heterogeneous database integration: • Query-driven approach (wrappers and integrators) • Update-driven approach made by Radmilo Pesic & Branko Golubovic
Differences Between Operational Database Systems and Data Warehouse • Users and system orientation • Data contents • Database design • View • Access patterns Why have a separate data warehouse? made by Radmilo Pesic & Branko Golubovic
A Multidimensional Data ModelFrom Tables and Spreadsheets to Data Cubes • A data cube is defined by dimensions and facts • Dimension table • Fact table made by Radmilo Pesic & Branko Golubovic
Chicago 440 882 89 623 location (cities) New York 1560 968 38 872 Toronto 395 746 43 591 Vancouver 698 Q1 605 825 14 400 925 682 789 Q2 680 952 31 512 1002 time (quarters) 728 870 Q3 812 1023 30 501 984 784 Q4 927 1038 38 580 computer security home entertainment phone item (types) A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation made by Radmilo Pesic & Branko Golubovic
computer security home entertainment phone item (types) Chicago location (cities) New York Toronto Vancouver Q1 605 825 14 400 Q2 time (quarters) Q3 Q4 computer security computer security home entertainment phone home entertainment phone item (types) item (types) supplier=“SUP1” supplier=“SUP2” supplier=“SUP1” A 4-D data cube representation of sales data for AllElectronics made by Radmilo Pesic & Branko Golubovic
all 0-D (apex) cuboid item location time supplier 1-D cuboid time, supplier item, supplier 2-D cuboid time, item location, supplier time, location item, location time, location, supplier time, item, location item, location, supplier 3-D cuboid time, item, supplier 4-D (base) cuboid time, item, location, supplier Lattice of cuboids, making up a 4-D data cube made by Radmilo Pesic & Branko Golubovic
branch_key branch_name branch_type time_key item_key branch_key location_key dollars_sold units_sold item_key item_name brand type supplier_type location_key street city province_or_state country time_key day day_of_week month quarter year Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional Databases Star schema: • a large central table (fact table) • a set of smaller attendant tables (dimension tables), one for each dimension item dimension table time dimension table sales fact table location dimension table branch dimension table made by Radmilo Pesic & Branko Golubovic
supplier_key supplier_type branch_key branch_name branch_type location_key street city_key time_key item_key branch_key location_key dollars_sold units_sold item_key item_name brand type supplier_key time_key day day_of_week month quarter year city_key city province_or_state country time dimension table Snowflake schema: • a variant of star schema, where some dimension tables are normalized • reduce redundancies, but reduce the effectivness of browsing sales fact table item dimension table supplier dimension table branch dimension table location dimension table city dimension table made by Radmilo Pesic & Branko Golubovic
item_key time_key shipper_key from_location to_location dollars_sold units_shipped branch_key branch_name branch_type time_key item_key branch_key location_key dollars_sold units_sold item_key item_name brand type supplier_type location_key street city province_or_state country time_key day day_of__week month quarter year shipper_key shipper_name location_key shipper_type shipping fact table sales fact table shipper dimension table time dimension table Fact constelation: • multiple fact tables share dimension tables item dimension table branch dimension table location dimension table made by Radmilo Pesic & Branko Golubovic
Defining multidimensional schema • DMQL – data mining query language • Syntax: cube definition: define cube <cube_name> [<dimension_list>]: <measure_list> dimension definition: define dimension <dimension_name> as (<atribute_or_subdimension_list>) made by Radmilo Pesic & Branko Golubovic
Example: • Constellation schema defined in DMQL: define cube sales [time, item, branch, location]: dollars_sold=sum(sales_in_dollars), units_sold=count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollars_cost=sum(cost_in_dollars), unit_shipped=count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales made by Radmilo Pesic & Branko Golubovic
Measures:Their Categorization and Computation Measures, based on the aggregate function: • Distributive • Algebraic • Holistic made by Radmilo Pesic & Branko Golubovic
Introducing Concept Hierarchies • A concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts. location all all country Canada USA province_or_state British Columbia Ontario New York Illinois city Vancouver Victoria Toronto Ottawa New York Buffalo Chicago made by Radmilo Pesic & Branko Golubovic
Hierarchial and lattice structures of atributes in warehouse dimensions: country year province_or_state quarter city month week day street Hierarchy for location Lattice for time made by Radmilo Pesic & Branko Golubovic
OLAP Operations in the Multidimensional Data Model • Roll-up • Drill-down • Slice and dice • Pivot (rotate) • Other (drill-across, drill-through) made by Radmilo Pesic & Branko Golubovic
Chicago 440 location (cities) New York 1560 Chicago location (cities) Toronto 395 New York Vancouver Toronto location (countries) Vancouver USA 2000 Q1 605 825 14 400 Canada January 150 February Q1 1000 100 Q2 time (quarters) March 150 Q2 Q3 April time (quarters) May Q3 Q4 June time (months) July computer Q4 security home entertainment August phone computer security September item (types) home entertainment phone October item (types) November December computer security home entertainment phone item (types) drill-down on time (from quarters to months) roll-up on location (from cities to countries) made by Radmilo Pesic & Branko Golubovic
Chicago Chicago 440 location (cities) New York 1560 New York Toronto 395 location (cities) Vancouver Toronto Q1 605 825 14 400 computer security Vancouver 605 825 14 400 Q2 home entertainment phone time (quarters) item (types) Q3 home entertainment USA 395 605 Q4 location (cities) Canada Q1 605 computer computer 825 security item (types) time (quarters) home entertainment phone Q2 phone 14 item (types) computer security 400 home entertainment New York Vancouver item (types) Chicago Toronto location (cities) dice for (location=“Toronto” or “Vancouver”) and(time=“Q1”or “Q2”) and (item=“home entertainment” or “computer”) slice for time=“Q1” pivot made by Radmilo Pesic & Branko Golubovic
location customer continent group country province_or_state category city name street item day name brand category type month quarter year time A Starnet Query Model for Querying Multidimensional Databases made by Radmilo Pesic & Branko Golubovic
Data Warehouse ArchitectureSteps for the Design and Construction of Data Warehouse The Design of a Data Warehouse: A Business Analysis Framework • top-down view • data source view • data warehouse view • business query view made by Radmilo Pesic & Branko Golubovic
The Process of Data Warehouse Design • top-down approach • bottom-up approach • combined approach • waterfall method • spiral method Steps of the warehouse design: • Choosing a business proces to model; • Choosing the grain of the business proces; • Choosing the dimensions; • Choosing the measures. made by Radmilo Pesic & Branko Golubovic
A Three-Tier Data Warehouse Architecture Query/report Analysis Data mining Top tier: front-end tools Output OLAP server OLAP server Middle tier: OLAP server Monitoring Administration Data warehouse Data marts Bottom tier: data warehouse server Metadata repository Extract Clean Transform Load Refresh Data Operational databases External sources made by Radmilo Pesic & Branko Golubovic
There are three data warehouse models: • Enterprise warehouse • Data mart • Virtual warehouse made by Radmilo Pesic & Branko Golubovic
Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP Relational OLAP (ROLAP) servers: • use of relational or extended-relational DBMS • greater scalability Multidimensional OLAP (MOLAP) servers: • use of data cube – fast indexing • possible low storage utilization – use of compression Hybrid OLAP (HOLAP) servers: • scalability of ROLAP and faster computation of MOLAP • Microsoft SQL Server 7.0 OLAP Services supports HOLAP server made by Radmilo Pesic & Branko Golubovic
Data Warehouse Implementation • SQL group by Data cube computation extends SQL with compute cube • Example: • “Compute the sum of sales, grouping by item and city.” • “Compute the sum of sales, grouping by item.” • “Compute the sum of sales, grouping by city.” • The possible group by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()} made by Radmilo Pesic & Branko Golubovic
( ) 0-D (apex) cuboid (city) (item) (year) 1-D cuboids (city,item) (city,year) (item,year) 2-D cuboids 3-D (base) cuboids (city,item,year) Lattice of cuboids define cube sales [item, city, year]: sum(sales_in_dollars) compute cube sales made by Radmilo Pesic & Branko Golubovic
Number of cuboids in an n-dimensional data cube is 2n • Number of cuboids in an n-dimensional data cube where we have a concept hihierarchy (day<week<month<quarter<year) is: • Example: if the cube has 10 dimensions and each dimension has 4 levels, the total number of cuboids that can be generated will be 510 9.8 x 106 made by Radmilo Pesic & Branko Golubovic
Partial Materialization: Selected Computation of Cuboids There are three choices for data cube materialization given a base cuboid: (1) do not precompute any of the “nonbase” cuboids (no materialization) (2) precompute all of the cuboids (full materialization) (3) selectively compute a proper subset of the whole set of possible cuboids (partial materialization); the partial materialization of cuboids shoul consider three factors: • identify the subset of cuboids to materialize, • exploit the materialized cuboids during query processing, and • efficiently update the materialized cuboids during load and refresh. made by Radmilo Pesic & Branko Golubovic
Multiway Array Aggregation in the Computation of Data Cubes ROLAP: • Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. • Grouping is performed on some subaggregates as a “partial grouping step”. These “partial groupings” may be used to speed up the computation of other subaggregates. • Aggregates may be computed from previously computed aggregates, rather than from the base fact tables. MOLAP: • Partitition the array into chunks. • Compute aggregates by visiting cube cells. made by Radmilo Pesic & Branko Golubovic
c3 61 62 63 64 c2 C 45 46 47 48 c1 29 30 31 32 60 c0 44 56 b3 28 1314 15 16 40 52 b2 9 24 B 36 b1 5 20 b0 1 2 3 4 a0 a1 a2 a3 A A 3-D array for the dimensions A, B, and C, organized into 64 chunks made by Radmilo Pesic & Branko Golubovic
Indexing OLAP Data • Bitmap indexing • Join indexing made by Radmilo Pesic & Branko Golubovic
Bitmap Indexing Base table Item bitmap index table City bitmap index table Indexing OLAP data using bitmap indices made by Radmilo Pesic & Branko Golubovic
Join Indexing Join index table for item/sales Join index table for location/sales sales T57 location item Main Street Sony-TV T238 Join index table linking two dimensions location/item/sales T459 T884 Linkages between a sales fact table and dimension tables for location and item Join index tables based on the linkages between the sales fact table and dimension tables for location and item made by Radmilo Pesic & Branko Golubovic
Efficient Processing of OLAP Queries • Determine which operations should be performed on the available cuboids • Determine to which materialized cuboid(s) the relevant operations should be applied made by Radmilo Pesic & Branko Golubovic