850 likes | 1.04k Views
Tutorial Document of ITM638 Data Warehousing and Data Mining. Dr. Chutima Beokhaimook 24 th March 2012. Data Warehouses and OLAP Technology. What is Data Warehouse?. Data warehouse have been defined in many ways
E N D
Tutorial DocumentofITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24th March 2012
What is Data Warehouse? • Data warehouse have been defined in many ways • “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process” – W.H. Inmon • The four keywords :- subject-oriented, integrated, time-variant and non-volatile
So, what is data warehousing ? • A process of constructing and using data warehouses • The utilization of a data warehouse necessitates a collection of decision support technologies • This allow knowledge workers (e.g. managers, analysts and executives) to use the data warehouse to obtain an overview of the data and make decision based on information in the warehouse • Term “warehouse DBMS” – refer to the management and utilization of data warehouse Data integration Constructing A Data Warehouse Data consolidation Data cleaning
Operational DBMS OLTP (on-line transaction processing) Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, etc. Data warehouses OLAP (on-line analytical processing) Serve users or knowledge workers in the role of data analysis and decision making The system can organize and present data in various formats Operational Database vs. Data Warehouses
Why Have a Separate Data warehouse? • High performance for both systems • An operational database – tuned for OLTP: access methods, indexing, concurrency control, recovery • A data warehouse – tuned for OLAP: complex OLAP queries, multidimensional view, consolidation • Different functions and different data • DSS require historical data, whereas operational DB do not maintain historical data • DSS require consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high-quality, clean, and integrated data, whereas operational DB contain only detailed raw data, which need to be consolidate before analysis
A Multidimensional Data Model(1) • Data warehouses and OLAP tools are based on a multidimensional data model – views data in the form of a data cube • A data cube allows data to be modeled and viewed in multiple dimension • Dimensions are the perspectives of entities with respect to which an organization wants to keep records • Example • A sales data warehouse keep records of the store’s sales with respect to the dimensions time, item, branch and location. • Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. • Ex. item(item_name, brand, type) • Dimension tables can be specified by users or experts, or automatically adjusted based on data distribution
A Multidimensional Data Model (2) • A multidimensional model is organized around a central theme, for instance, sales, which is represented by a fact table • Facts are numerical measures such as quantities :-dolar_sold, unit_sold, amount_budget
Example: A 2-D view Table 2.1 A 2-D view of sales data according to the dimension time and item, where the sales are from braches located in Vancouver. The measure shown is dollar_sold (in thousands)
Example: A 3-D View Table 2.2 A 3-D view of sales data according to the dimension time and item and location. The measure shown is dollar_sold (in thousands)
Example: A 3-D data cube A 3-D data cube represent the data in table 2.2 according to the dimension time and item and location. The measure shown is dollar_sold (in thousands)
Star Schema • The most common modeling paradigm, in which the data warehouse contains • A large central table (fact table) containing the bulk of the data, no redundancy • A set of smaller attendant table (dimension table), one for each dimension
Example: star schema of a data warehouse for sales • A central fact table is sales • that contains keys to each of the four dimensions, • along with 2 measures: dollars_sold and unit_sold.
Example: fact constellation schema of a data warehouse for salesand shipping 2 fact models
A Concept Hierarchies • A concept hierarchy defines a sequence of mapping form a set of low-level concepts to higher-level, more general concepts (Example below is location)
A Concept Hierarchies(2) • Many concept hierarchies are implicit within the database schema location whichis described by attributes number, street, city, province_or_state, zipcode and country time whichis described by attributes day, week, month, quarter and year year country quarter province_or_state city week month street day Total order hierarchy Partial order hierarchy
Typical OLAP Operations for multidimensional data(1) • Roll-up (drill-up): climbing up a concept hierarchy or dimension reduction – summarize data • Drill down(roll-down): stepping down a concept hierarchy or introducing additional dimensions • reverse of roll-up • Navigate from less detailed data to more detailed data • Slice and dice: • Slice operation perform a selection on one dimension of the given cube, resulting in subcube. • Dice operation defines a subcube by performing a selection on two or more dimensions
Typical OLAP Operations for multidimensional data(2) • Pivot (rotate): • A visualization operation that rotate data axes in view in order to provide an alternative presentation of the data • Other OLAP operations: such as • drill-across – execute queries involving more than one fact table • drill-through
Location (cities) Location (cities) Chicago 440 USA 2000 New York 1560 Toronto Canada 395 1000 Vancouver 605 825 14 400 Q1 Q1 Q2 Q2 Time (Quarter) Q3 Time (Quarter) Q3 Q4 Q4 Computer Security Computer Security Home Entertainment Phone Home Entertainment Phone item (type) item (type) roll-up on location (from cities to countries)
Chicago New York Toronto Location (cities) Vancouver 150 Jan Location (cities) Chicago 100 440 New York 1560 Feb Toronto 395 150 Vancouver Time (Quarter) 605 825 14 400 Mar Q1 Apr Q2 Time (Quarter) May Q3 Jun Q4 Jul Computer Security Home Entertainment Phone Aug drill-down on time (from quarters to months) Sep item (type) Oct Nov Dec Computer Security Home Entertainment Phone item (type)
Location (cities) Chicago 440 New York 1560 Location (cities) Toronto 395 Toronto Vancouver 395 Vancouver 605 825 14 400 605 Q1 Q1 Q2 Q2 Time (Quarter) Q3 Computer Home Entertainment Q4 Computer Security item (type) Home Entertainment Phone dice for (location=“Toronto”or“Vancouver”)and (time = “Q1” or “Q2”) and (item=“home entertainment” or “computer”) item (type)
Location (cities) Chicago 440 New York 1560 Chicago Toronto 395 Vancouver 605 825 14 400 New York Q1 Toronto Q2 Time (Quarter) 605 825 14 400 Vancouver Q3 slice for (time=“Q1”) Computer Security Q4 Home Entertainment Phone Computer Security Home Entertainment item (type) Phone 440 1560 395 605 Home Entertainment item (type) 825 pivot Computer item (type) 14 Phone 400 Security Vancouver New York Toronto Chicago Location (cities)
What is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories • Applications: • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. • Rule form: “Body Head [support, confidence]” • buys(x, “diapers”) buys(x, “beers”) [0.5%,60%] • major(x, “CS”) take (x, “DB”) grade (x, “A”) [1%,75%]
A typical example of association rule mining is market basket analysis.
The information that customers who purchase computer also tend to buy antivirus software at the same time is represented in Association Rule below: computer antivirus_software [support = 2%, confidence = 60%] • Rule support and confidence are two measures of rule interestingness • Support= 2% means that 2% of all transactions under analysis show that computer and antivirus software are purchased together • Confidence=60% means that 60% of the customers who purchased a computer also bought the software • Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold • Such threshold can be set by users of domain experts
Rule Measure: Support and Confidence • Find all the rule A B C with minimum confidence and support • Let min_sup=50%, min_conf.=50% Support: probability that a transaction contain {ABC} Confidence: Condition probability that a transaction having {AB} also contain {C} • Typically association rules are considered interesting if they satisfy both a minimum support threshold and a mininum confidence threshold • Such thresholds can be set by users or domain experts
Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong • A set of items is referred to as an itemset • An itemset that contains k items is a k-itemset • The occurrence frequency of an itemset is the number of transactions that contain the itemset • An itemset satisfies minimum support if the occurrence frequency of the itemset >= min_sup * total no. of transaction • An itemset satisfies minimum support it is a frequent itemset
Two Steps in Mining Association Rules • Step1 :Find all frequent itemsets • A subset of a frequent itemset must also be a frequent itemset • i.e. if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset • Iteratively find frequent itemset with cardinality from 1 to k (k-itemset) • Step2 : generate strong association rules from the frequent itemsets
Mining Single-Dimensional Boolean Association Rules From Transaction Databases • Methods for mining the simplest form of association rules: single-dimensional, single-level, boolean association rules Apriori algorithm • The Apriori algorithm : Finding frequent itemset for boolean association rules • Lk : frequent k- itemset is used to explore Lk+1 • Consists of join and prune step • The join step: A set of candidate k-itemset (Ck) is generated by joining Lk-1 with itself • The prune step: Determine Lk as : any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk: Frequent itemset of size k L1= {frequent 1-itemsets}: for (k=1; Lk!=; k++) do begin Ck+1=candidates generated from Lk; for each transaction t in database D do increment count of all candidates in Ck+1 that are contained in t Lk+1=candidate in Ck+1 with min_support end Return kLk;
Example: Finding frequent itemsets in D • Each item is a member of the set of candidate 1-itemsets (C1), count the number of occurrences of each item • Suppose the minimum transaction support count = 2, the set of L1 = candidate 1-itemsets that satisfy minimum support • Generate C2 = L1L1 • Continue the algo. • Until C4= Transaction database D |D| = 9
Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3L3 • C4 ={abcd acde} • Pruning: • acde is remove because ade is not in L3 • C4={abcd}
Generating Association Rule from frequent Itemsets • confidence(AB)= P(B|A)=support_count(AB) support_count(A) • support_count (AB) is the no. of transaction containing the itemsets AB • support_count (A) is the no. of transaction containing the itemsets A • Association rules can be generated as • For each frequent itemset l, generate all nonempty subset of l • For every nonempty subset s of l, output the rule s(l-s) if support_count(l) support_count(s) Min_conf. is the minimum confidence threshold min_conf.
Example • Suppose the data contain the frequent itemset l={I1,I2,I5} What are the association rules that can be generated from l? • The nonempty subsets of l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5} • The resulting association rules are 1 l1l2l5 confidence=2/4=50% 2l1l5l2 confidence=2/2=100% 3 l2l5l1 confidence=2/2=100% 4l1l2l5 confidence=2/6=33% 5 l2l1l5 confidence=2/7=29% 6 l5l1l2 confidence=2/2=100% • If minimun confidence threshold = 70% • Output are the rule no. 2,3 and 6
976-451 Data Warehousing and Data Mining Lecture 5Classification and Prediction Chutima Pisarn Faculty of Technology and Environment Prince of Songkla University
What Is Classification? • Case • A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank • A marketing manager at AllElectronics needs data analysis to help guess whether a customer with a given profile will buy a new computer • A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive • The data analysis task is classification, where the model or classifier is constructed to predict categorical labels, such as • “safe” or “risky” for the loan application data • “yes” or “no” for the marketing data • “treatment A”, “treatment B” or “treatment C” for the medical data
What Is Prediction? • Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at AllElectronics • This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label • This model is a predictor • Regression analysis is a statistical methodology that is most often used for numeric prediction
How does classification work? • Data classification is a two-step process • In the first step, -- learning step or training phase • a model is built describing a predetermined set of data classes or concepts • The model is constructed by analyzing database tuples described by attributes • Each tuple is assumed to belong to a predefined class, as determined by the class label attribute • Data tuples used to build the model are called training data set • The individual tuples in a training set are referred to as training samples • If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning (or clustering) • The learned model is represented in the form of classification rules, decision trees or mathematical formulae
How does classification work? • In the second step, • The model is used for classification • First, estimate the predictive accuracy of the model • The holdout method is a technique that uses a test set of class-labeled samples which are randomly selected and are independent of the training samples • The accuracy of a model on a given test set is the percentage of test set correctly classified by model • If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data • If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown
How is prediction different from classification? • Data prediction is a two step process, similar to that of data classification • For prediction, the attributefor which values are being predicted is continuous-value (ordered) rather than categorical (discrete-value and unordered) • Prediction can also be viewed as a mapping or function, y=f(X)
Classification by Decision Tree Induction • A decision tree is a flow-chart-like tree structure, • each internal node denotes a test on an attribute, • each branch represents an outcome of the test • leaf node represent classes • Top-most node in a tree is the root node The decision tree represents the concept buys_computer Age? <=30 >40 31…40 Credit_rating? student? yes yes no excellent fair yes no yes no
Attribute Selection Measure • The information gain measure is used to select the test attribute at each node in the tree • Information gain measure is referred to as an attribute selection measure or measure of the goodness of split • The attribute with the highest information gainis chosen as the test attribute for the current node • Let S be a set consisting of s data samples, the class label attribute has m distinct value defining m distinct classes, Ci (for i=1,...,m) • Let si be the no of sample of S in class Ci • The expected information I(s1,s2,…sm)=- pi log2(pi), • where pi is the probability that the sample belongs to class, pi =si/s m i=1
Attribute Selection Measure(cont.) • Find an entropy of attribute A • Let A have distinct value {a1,a2,…,a} which can partition S into {S1,S2,….S} • For each Sj, sij is the number of samples Sj of class Ci • The entropy or expected information based on attribute A is given by • E(A)= s1j+…+smj • Gain(A)=I(s1,s2,…sm)-E(A) • The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S I(s1j,…,smj) j=1 s
Example the class label attribute: 2 classes I(s1,s2) = I(9,5) = -9/14 log2 (9/14)-5/14log2(5/14) =0.940
I(s1,s2) = I(9,5) = -9/14 log2(9/14)- 5/14 log2(5/14) =0.940 Compute the entropy of each attribute For attribute “age” For age=“<=30” s11=2, s21=3 For age=“31…40” s12=4, s22=0 For age=“>40” s13=3, s23=2 Gain(age) = I(s1,s2) – E(age) = 0.940 –[(5/14)I(2,3)+(4/14)I(4,0)+(5/14)I(3,2) = 0.246 For attribute “income” For income=“high” s11=2, s21=2 For income=“medium” s12=4, s22=2 For income=“low” s13=3, s23=1 Gain(income) = I(s1,s2) – E(income) = 0.940 –[(4/14)I(2,2)+(6/14)I(4,2)+(4/14)I(3,1) = 0.029