200 likes | 308 Views
An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007. Outline. Why data mining? Data mining applications Data mining functionalities Concept description Association analysis
E N D
An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007
Outline • Why data mining? • Data mining applications • Data mining functionalities • Concept description • Association analysis • Outlier Analysis • Evolution Analysis • Classification • Clustering
Why data mining? • Motivation: • Wide availability of huge amounts of data • Need for turning data into useful info & knowledge • Data mining: • Extracting or “mining” knowledge from large amounts of data • Knowledge : useful patterns • Semiautomatic process • Focus on automatic aspects
Data mining applications • Prediction. Examples: • Credit risk • Customer switching to competitors • Fraudulent phone calling card usage • Associations. Examples: • Related books for buy • Related accessories for suggest: e.g. camera • Causation discovery: e.g. medicine • Clusters. Example: • Clusters of disease
Data mining functionalities • Concept description • Characterization & discrimination • Association analysis • Outlier Analysis • Evolution Analysis • Classification and Prediction • Clustering
Concept description • Description of concepts • summarized, concise & precise • Ways: • Data characterization • Summarizing the data of the target class in general terms • Data discrimination • Comparison of the target class with the contrasting class(es) • Examples of Output forms: • Pie charts, bar charts, curves & multidimensional tables
Association analysis • Mining frequent patterns • For discovery of interesting associations within data • Kinds of frequent patterns: • Frequent itemset • Set of items frequently appear together. E.g. milk and bread • Frequent subsequence • E.g. pattern of customers’ purchase: • First a PC, then a digital camera & then a memory card • Frequent substructure • Structural forms such as graphs, trees, or lattices • Support and confidence
Outlier Analysis • Outliers: • data objects disobeying the general behavior of data • Approaches to outliers • Discard as noise or exceptions • Keep for applications such as fraud detection • Example: detecting fraudulent usage of credit cards • Ways: • Using statistical tests • Using distance measures • Using deviation-based methods
Evolution Analysis • Description and modeling of trends • For objects with changing behavior over time • Ways: • Applying other data mining tasks on time related data • Association analysis, classification, prediction, clustering & … • Distinct ways • time-series data analysis • sequence or periodicity pattern matching • similarity-based data analysis • Example: stock market: predict future trends in prices
Classification and Prediction • Classification: • Process of finding a model that distinguishes data classes • Purpose: using the model to predict the class of new objects • Deriving model: • Based on the analysis of a set of training data • data objects with known class labels • Example: • In a credit card company • Classification of customers based on their payment history • Prediction of a new customer’s credit worthiness
Classification • A two-step process for classification: • First: Learning or training step • Building the classifier by analyzing or learning from training data • Second: classifying step • Using classifier for classification • Accuracy of a classifier (on a given test set) • Percentage of test set tuples correctly classified by classifier • Classification methods: • Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …
Decision tree • Decision tree induction : • Learning of decision trees from class-labeled training tuples • Decision tree: A flowchart-like tree structure • Internal nodes: tests on attributes • Branches: outcomes of the test • Leaves: class labels • Usage in classification: • Prediction by tracing a path from the root to a leaf node • Testing attribute values of new tuple against decision tree • Easily converting Decision tree to classification rules
Bayesian Classification • Bayesian classification • Predicting the probability that a new tuple belongs to a particular class • High accuracy and speed in large databases • Based on Bayes’ theorem • Conditional probability • Naïve Bayesian classifier • Assumption: class conditional independence • Good for Simplifying computations
Clustering • The process of grouping a set of physical or abstract objects into classes of similar objects • Generating class labels for objects currently without label • Clustering based on this principle: • Maximizing the intraclass similarity and • Minimizing the interclass similarity • Clustering also for facilitating taxonomy formation • Hierarchical organization of observations
Restaurant database Preprocessing Object View for Clustering Clustering A Set of Similar Object Clusters Summarization White Collar for Dinner Retired for Lunch Young at midnight An example: clustering customers in a restaurant
Steps of database Clustering • Define object-view • Select relevant attributes • Generate suitable input format for the clustering tool • Define similarity measure • Select parameter settings for the chosen clustering algorithm • Run clustering algorithm • Characterize the computed clusters
Challenge: database clustering • Data collections are in many different formats • Flat files • Relational databases • Object-oriented database • Flat file format: • The simplest and most frequently used format in the traditional data analysis area • Databases are more complex than flat files
Challenge: database clustering (cont.) • Challenge: Changing clustering algorithms to become more directly applicable to real-world databases • Issues related to databases: • Different types of objects in DB • Relationships between objects: 1:1, 1:n & n:m • Complexity in definition of object similarity • Due to the presence of bags of values for an object • Difficulty in selection of an appropriate similarity measure • Due to the presence of different types for attributes of objects
Refferences • Han, J., Kamber, M., Data Mining: Concepts and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3. • Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3. • Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).