Data Mining Methodology and Functionalities Overview

Data Mining and Knowledge Acquizition — Chapter 7 ——Data Mining Overviewand Exam Questions— 2014/2015 Summer

Outline • Methodology - Overview • Introduction • Data Description – Preprocessing • OLAP • Clustering • Classification • Numerical Prediction - Regression • Frequent Pattern Mining • Recent BIS Exams • Unclassified Questions

Methodology and Overview • KDD Methodology • Functionalities

KDD Methodology • Methodology • Problem definition • Data set selection • Preprocessing transformations • Functionalities • Classification/numerical prediction • Clustering • Frequent Pattern Mining • Association • Sequential analysis • others

KDD Methodology (cont.) • Algorithms • For classification you can use • Decision trees ID3,C4.5 CHAID are algorithms • For clustering you can use • Partitioning methods k-means,k-medoids • Hierarchical AGNES • Probabilistic EM is an algorithm • Presenting results • Back transformations • Reports • Taking action

Data Description • Single variables • Categorical - Ordinal, nominal • Frequency plots, tables, Pie charts • Continuous – interval, ratio • 5-summary, centeral tendency, spread • Examine the probability distribution • For two variables • Both categorical • Cross tabulation • One categorical the other continuous • Both are continuous • correlation coeficient, scatter plots

Preprocessing • Missing values • Inconsistencies • Redundent data • Outliers • Data transformations • Data reduction • Attribute elimination • Attribute combination • Samplinng • Histograms

Functionalities • Styles of Data Mining • Descriptive - OLAP • Classification • Numerical Prediction • Clustering • Frequent Pattern Mining

Two basic style of data mining • Descriptive • Cross tabulations,OLAP,attribute oriented induction,clustering,association • Predictive • Classification,numerical prediction • Difference between classification and numerical prediction • Questions answered by these styles • Supervised v.s. Unsupervised

Descriptive - OLAP • Concept of data cube • Fact table • Measures – calculated measures • Keys • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child • Attributes not suitable for concept hierarcies

Classification • Methods • Decision trees • Neureal networks • Bayesian • K-NN or model based reasoning • Adventages disadventages • Given a problem which data processing techniques are required • Given a problem shich classification method or algorithm is more apprpriate

Classification (cnt.d) • Accuracy of the model • Measures for classification/numerical prediction • How to better estimate • Holdout,cross validation, bootstraping • How to improve • Bagging, boosting • For unbalanced classes • What to do with models • Lift charts

Numercal Prediction • Learning is supervised • Output variable is continuous • Methods • Regression • Simple • Multiple • Most methods for classification can be used for numerical prediction as well • Accuricy • Root mean square, absolute mean deviation

Clustering • Distance measures • Dissimilarity or similarity • For different type of variables • Ordinal,binary,nominal,ratio,interval • Why need to transform data • Partitioning methods • K-means,k-medoids • Adventage disadventage • Hierarchical • Density based • probablistic

Frequent Pattern Mining • Association analysis • Apriori or FP-Growth • How to measure strongness of rules • Support and confidence • Other measures of interestingness critique of support confidence • Multiple levels • Constraints • Sequential pattern mining

Introduction • Defineing problems • Given a short description of an environment, deine data mining problems fiting to different functionalities, possible preprocessing problems paciliur to the environment • Basic functionalities • Given a short description of a data mining problem, with which functionality the problem is solved?

Big University Library • 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation?

Big University Library (cont.) • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

Big University Library (cont.) • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

Data mining on MIS • A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt)

Data mining on MIS (cont.) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

Data mining on MIS (cont.) • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

Data Description • How to describe single variables – categorical and continuous • How to desribe two association between two variables • bnoth continuous • both categorical • One continous, one categorical

Preprocessing • What to do as preprocessing? • Which techniques are applied? • For what reason?

MIS 542 Midterm 2011/2012 Fall PCA • 5. (10 points) Consider two continuous variables X and Y. Generate data sets • a) where PCA (principle component analysis) can not reduces the dimensionality from two to one • b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one

MIS 542 Final 2011/2012 Falloutliers • 1 (20 points) Give two examples of outliers. • a) Where outliers are useful and essential patterns to be mined. • b) Outliers are useless steaming from error or noise.

MIS 542 Final 2011/2012 Fall transformations • 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points).

OLAP • Concept of data cube • Fact table • Measures – calculated measures • Keys • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child • Attributes not suitable for concept hierarcies

Data warehouse for library • A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. • Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. • What is the total number of cuboids for the library cube? • Describe three meaningfull OLAP queries and write sql expresions for one of them.

Big University • 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005)

Big University (cont.) • a) draw a snawflake sheam diagram for that warehouse • What are the concept hierarchys for the dimensions • b) What is the total nmber of cuboids

MIS 542 Final 2005/2006 Spring olap • 1. MIS department wants to revise academic strategies for the following ten years. Relevent • questions are: What portion of the courese are required or elective? What is the full time part • time distribution of instuctors? What is the course load of instructors? What percent of • technical or managerial courses are thought by part time instructors? How all theses things

MIS 542 Final S06 1 cont. • changed over years? You can add similar stategic quustions of your own. Do not conside • students aspects of the problem for the time being. Desing and OLAP sheam to be used as a • strategic tool. You are free to decide the dimensions and the fact table. Describe the concept • hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to • answer three of such strategic questions

MIS 54 Final 2012/2013 Hospital • 2. (20 pts) Suppose that a data warehouse for a hospital consists of the following dimensions: time, doctor and patient and the two measures count and charge, where charge is the fee a doctor charge a patient for a visit. • Design a warehouse with star schema: • a) Fact table: Design the fact table. • b) Dimension tables: For each dimension show a reasonable concept hierarchy. • c) State two questions that can be answered by that OLAP cube. • d) Show drilldown and roll up operations related to one of these questions

Humman Resource cube • 1. (25 points) In an organization a data warehouse is to be designed for evaluating performance of employees. To evaluate performance of an employee, survey questionnaire is consisting a set of questions with 5 Likered scale are answered by other employees in the same company at specified times. That is, performance of employees are rated by other employees. • Each employee has a set of characteristics including department, education,… Each survey is conducted at a particular date applied to some of the employees. Questions are aimed to evaluate broad categories of performance such as motivation, cooperation ability,… • Typically, a question in a survey, aiming to measure a specific attitude about an employee is evaluated by another employee (rated f rom 1 to 5) Data is available at question level.

Human resource cube (cont.) • Cube design: a star schema • Fact table: Design the fact table should contain one calculated member. What are the measures and keys? • Dimension tables: Employee, and Time are the two essential dimensions include a Survey and Question dimensions as well. For each dimension show a concept hierarchy. • State three questions that can be answered by that OLAP cube. • Show drilldown and role up operations related to these questions

MIS Midterm 2008/2009 Spring Shipment • 1. (20 points) Consider a shipment company responsible for shipping items from one location to another on predetermined due dates. Design a star schema OLAP cube for this problem to be used by managers for decision making purposes. The dimensions are time, item to be shipped, person responsible for shipping the item, location.. For each of these dimensions determine three levels in the concept hierarchy. Design the fact table with appropriate measures:and keys (include two measure and at least one calculated member in the fact table) • Show one drilldown and role up operations • Show the SQL query of one of the cuboids.

Outline • Clustering

Comparing clustering methods • Clustering methods • Partitioning, hierarchical, density based, model-based: probabnlistic EM • Compare clustering methods • Output • İnterpreteation • Sensitivity ot aoutliers • Speed of computation

clustering • Construct simple data sets showing the inadequacies of k-means clustering (20 pnt) • this algorithm is not suitable of even spherical clusters of different sizes • What are the adventages and disadventage of using k-means

clustering • Consider a delivery center location decision problem in a city where a set of related products are to be delivered to markets located in the city. Design an algortihm for this lacation selection problem extending an algortihm we cover in class. State clearly the algorithm and its extensions.for this particular problem.

Clustering preferences • Consider a popular song competition. There are N competitors A1, A2,… AN. Number of voters is very large; a substantial fraction of the population of the country. Each voter is able to rank the competitors form best to worst e.g. for voter 1 (A4>A2>A3>A1) meaning that there are four competitors and A4 is the best for voter 1 A1 being the worst. Suppose preference data is available for a sample of n voters at the beginning of competition. • Develop a distance measure between the preferences of two voters i and j • Suppose you have the k-means algorithm available in a package. Describe how you can use the k-means algorithm to clusters voters according to their preferences.

MIS 542 Final 2005/2006 Spring • 3. a) Describe how to modify k-means algorithm so as to handle categorical variables (binary, ordinal, nominal). • b) What is a disadventage of Agglomerative hierarchical clustering method in the case of large data. Suggest a way of eliminating this disadventages while benefiting the adventages of agglomerative methods

MIS 542 Midterm 2007/2008 Spring • Generate data set of two continuous variables X and Y. Consider clustering based on density • When clustered with one variable there (either X or Y) there is one cluster • When clustered with both variable there there are two clusters

MIS 542 Final 2011/2012 Fall • 3 a (10 points) Generate data sets for two clustering problems with two continuous variables. Two natural clusters for the notion of density based clustering but the quality of these clusters are low for a partitioning approach based on dissimilarity such as k-means • 3.b (10 points) Considering the advantages and disadvantages of partitioning and hierarchical agglomerative clustering approaches. Design a method for combining the two approaches to improve good clustering quality. (Finally there are hierarchies of clusters)

MIS Midterm 2011/2012 Fall • 6. (25 points) A retail company asked to segment its customers. Following variables are available for each customer: age, income, gender number of children, occupation, house owner, have a car or not. There are 6 category of goods sold by the company and total purchases from each category is available for each customer, in addition average • inter-purchase time is also included in the database.

Data Mining Methodology and Functionalities Overview

Data Mining Methodology and Functionalities Overview

Presentation Transcript