250 likes | 484 Views
Knowledge Discovery and Data Mining. Evgueni Smirnov. Outline. Data Flood Definition of Knowledge Discovery and Data Mining Possible Tasks: Classification Task Regression Task Clustering Task Association-Rule Task. Data Flood. Trends Leading to Data Flood. Moore’s law
E N D
Knowledge Discovery and Data Mining Evgueni Smirnov
Outline • Data Flood • Definition of Knowledge Discovery and Data Mining • Possible Tasks: • Classification Task • Regression Task • Clustering Task • Association-Rule Task
Trends Leading to Data Flood • Moore’s law • Computer Speed doubles every 18 months • Storage law • total storage doubles every 9 months As a result: • More data is captured: • Storage technology faster and cheaper • DBMS capable of handling bigger DB
Trends Leading to Data Flood • More data is generated: • Business: • Supermarket chains • Banks, • Telecoms, • E-commerce, etc. • Web • Science: • astronomy, • physics, • biology, • medicine etc.
Consequence • Very little data will ever be looked at by a human, and thus, we need to automate the process of Knowledge Discovery to make sense and use of data.
Definition of Knowledge Discovery • Knowledge Discovery in Data is non-trivial process of identifying • valid • novel • potentially useful • and ultimately understandablepatterns in data. • from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996.
Related Fields Machine Learning Visualization Knowledge Discovery Statistics Databases
Processed data Interpretation Evaluation DataMining Transformation & feature selection Preprocessing & cleaning Selection Knowledge-Discovery Methodology Knowledge Data Mining is searching for patterns of interest in a particular representation. Transformed data Patterns Target data data
Data-Mining Tasks • Classification Task • Regression Task • Clustering Task • Association-Rule Task
Classification Task • Given: a collection of instances (training set) • Each instances is represented by a set of attributes, one of the attributes is the class attribute. • Find: a classifier for the class attribute as a function of the values of other attributes. • Goal:previously unseen instances should be assigned a class as accurately as possible.
Test Set Classifier Example 1 categorical categorical continuous class Learn Classifier Training Set
Example 2 • Fraud Detection • Goal: Predict fraudulent cases in credit card transactions. • Approach: • Use credit card transactions and the information on its account-holder as attributes. • When does a customer buy, what does he buy, how often he pays on time, etc • Label past transactions as fraud or fair transactions. This forms the class attribute. • Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card transactions on an account.
Regression Task • Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices.
Clustering Task • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: • Data points in one cluster are more similar; • Data points in separate clusters are less similar. Intra-cluster distances are minimized Inter-cluster distances are maximized
Example • Market Segmentation: • Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. • Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Association-Rule Task • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: Milk --> Coke Diaper, Milk --> Beer
Example • Supermarket shelf management. • Goal: To identify items that are bought together by sufficiently many customers. • Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. • A classic rule -- • If a customer buys diaper and milk, then he is very likely to buy beer. • So, don’t be surprised if you find six-packs stacked next to diapers!
Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Course Overview data Tuesday: Decision Trees and Decision Rules (Evgueni Smirnov) Introduction to Transfer in Supervised Learning (Haitham Bou Ammar)
Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Course Overview data Wednesday: Evaluation of Learning Models (Evgueni Smirnov) Regression Analysis (Georgi Nalbantov) Self-Taught Learning (Haitham Bou Ammar)
Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Course Overview data Thursday : Instance learning and Bayesian learning (Kurt Diriessens) Feature Selection and Reduction; Clustering (Georgi Nalbantov)
Processed data Interpretation Evaluation Data Mining Transformation & feature selection Preprocessing & cleaning Selection Course Overview data Friday : Association Rules (Kurt Diriessens) Ensembles (Evgueni Smirnov) Deep Transfer (Haitham Bou Ammar)