290 likes | 505 Views
Data Mining LECTURE # 01 Introduction to Data Mining. Motivation: “Necessity is the Mother of Invention”. Data Explosion Problem
E N D
Motivation: “Necessity is the Mother of Invention” • Data Explosion Problem • Automated data collection tools (e.g. web, sensor networks) and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. • Currently enterprises are facing data explosion problem. • Electronic Information an Important Asset for Business Decisions • With the growth of electronic information, enterprises began to realizing that the accumulated information can be an important asset in their business decisions. • There is a potential business intelligence hidden in the large volume of data. • This intelligence can be the secret weapon on which the success of a business may depend.
Extracting Business Intelligence (Solution) • It is not a Simple Matter to discover Business Intelligence from Mountain of Accumulated Data. • What is required are Techniques that allow the enterprise to Extract the Most Valuable Information. • The Field of Data Mining provides such Techniques. • These techniques can Find Novel Patterns (unknown) that may Assist an Enterprise in Understanding the business better and in forecasting.
Data Mining vs SQL, EIS, and OLAP • SQL. SQL is a query language, difficult for business people to use • EIS = Executive Information Systems.EIS systems provide graphical interfaces that give executives a pre-programmed (and therefore limited) selection of reports, automatically generating the necessary SQL for each. • OLAPallows views along multiple dimensions, and drill-drown, therefore giving access to a vast array of analyses. However, it requires manual navigation through scores of reports, requiring the user to notice interesting patterns themselves. • Data Mining picks out interesting patterns.The user can then use visualization tools to investigate further. 4
An Example of OLAP Analysis and its Limits • What is driving sales of walking sticks ? • Step 1: View some OLAP graphs: e.g. walking stick sales by city. • Step 2: Noticing that Islamabad has high salesyou decide to investigate further. • (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down). • It seems that old people are responsible for most walking stick sales. You confirm this by viewing a chart of age distributions by state. • But imagine if you had to do this manual investigation for all of the 10,000 products in your range ! Here, OLAP gives way to Data Mining. Step 1 Step 2 5
Data Mining vs Expert Systems • Expert Systems = Rule-Driven DeductionTop-down: From known rules (expertise) and data to decisions. Rules Expert System Decisions Data • Data Mining = Data-Driven InductionBottom-up: From data about past decisions to discovered rules (general rules induced from the data). Data Mining Rules Data(including past decisions) 6
Difference b/w Machine Learning and Data Mining • Machine Learning techniques are designed to deal with a limited amount of artificial intelligence data. Where the Data Mining Techniques deal with large amount of databases data. • Data Mining (Knowledge Discovery in Databases) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases. • What is not Data Mining? • (Deductive) query processing. • Expert systems or small ML/statistical programs
Data Mining (Example) • Random Guessing vs. Potential Knowledge • Suppose we have to Forecast the Probability of Rain in Islamabad city for any particular day. • Without any Prior Knowledge the probability of rain would be 50% (pure random guess). • If we had a lot of weather data, then we can extract potential rules using Data Mining which can then forecast the chance of rain better than random guessing. • Example: The Rule if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of rain.
The Data Mining Process • Step 0: Determine Business Objective- e.g. Forecasting the probability of rain • - Must have relevant prior knowledge and goals of application. • Step 1: Prepare Data • - Noisy and Missing values handling (Data Cleaning). • - Data Transformation (Normalization/Discretization). • - Attribute/Feature Selection. • Step 2: Choosing the Function of Data Mining • - Classification, Clustering, Association Rules • Step 3: Choosing The Mining Algorithm • - Selection of correct algorithm depending upon the quality of data. • - Selection of correct algorithm depending upon the density of data. • Step 4: Data Mining - Search patterns of interest:- A typical data mining algorithm can mine millions of patterns. • Step 5: Visualization/Knowledge Representation - Visualization/Representation of interesting patterns, etc 9
Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases
Data Mining: On What Kind of Data? • Relational databases • Data warehouses • Transactional databases • Advanced DB and information repositories • Time-series data and temporal data • Text databases • Multimedia databases • Data Stream (Sensor Networks Data) • WWW
Data Mining Functionalities (1) • Data Preprocessing • Handling Missing and Noisy Data (Data Cleaning). • Techniques we will cover. • Missing values Imputation using Mean, Median and Mod. • Missing values Imputation using K-Nearest Neighbor. • Missing values Imputation using Association Rules Mining. • Data Binning for Noisy Data.
Data Mining Functionalities (1) • Data Preprocessing • Data Transformation (Discretization and Normalization). • With the help of data transformation rules become more General and Compact. • General and Compact rules increase the Accuracy of Classification. Child = (0 to 20) Young = (21 to 47) Old = (48 to 120) • If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No. • If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then Buy_Computer = No. • If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then Buy_Computer = No. • If attribute 1 = value1 & attribute 2 = value2 and Age = Child then Buy_Computer = No.
Data Mining Functionalities (1) • Data Preprocessing • Attribute Selection/Feature Selection • Selection of those attributes which are more relevant to data mining task. • Advantage1: Decrease the processing time of mining task. • Advantage2: Generalize the rules. • Example • If our mining goal is to find that countries which has more Cheat on which Taxable Income. • Then obviously the date attribute will not be an important factor in our mining task.
Data Mining Functionalities (1) • Data Preprocessing • Principle Component Analysis • Wrapper Based • Filter Based
Data Mining Functionalities (2) • Association Rule Mining • In Association Rule Mining Framework we have to find all the rules in a transactional/relational dataset which contain a support (frequency)Greater than some minimum support (min_sup) threshold (provided by the user). • For example with min_sup = 50%.
Data Mining Functionalities (2) • Association Rule Mining • Topic we will cover • Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-vector ). • Fault-Tolerant/Approximate Frequent Itemset Mining. • N-Most Interesting Frequent Itemset Mining. • Closed and Maximal Frequent Itemset Mining. • Incremental Frequent Itemset Mining • Sequential Patterns.
Data Mining Functionalities (2) • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • Example:Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes. • Must have known the previous business decisions (Supervised Learning). • Rule • If Temperature = Hot & Humidity = High then Rain = Yes. Prediction of unknown record
Data Mining Functionalities (2) • Cluster Analysis • Group data to form new classes based on un-labels class data. • Business decisions are unknown (Also called unsupervised Learning). • Example:Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes. 3 clusters
Data Mining Functionalities (3) • Outlier Analysis • Outlier: A data object that does not comply with the general behavior of the data. • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 2 outliers
Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. • Suggested approach: Query-based, Constraint mining • Interestingness Measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness • Can a data mining system find all the interesting patterns? • Remember most of the problems in Data Mining are NP-Complete. • There is no global best solution for any single problem. • Search for only interesting patterns: Optimization • Can a data mining system find only the interesting patterns? • Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—Constraint based mining (Give threshold factors in mining)
Reading Assignment • Book Chapter • Chapter 1 of “Jiawei Han and Micheline Kamber” book “Data Mining: Concepts and Techniques”.
Data Mining ------- Where? • Some Nice Resources • ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/. • Knowledge Discovery Nuggets www.kdnuggests.com. • IEEE Transactions on Knowledge and Data Engineering – http://www.computer.org/tkde/. • IEEE Transactions on Pattern Analysis and Machine Intelligence – http://www.computer.org/tpami/. • Data Mining and Knowledge Discovery - Publisher: Springer Science+Business Media B.V., Formerly Kluwer Academic Publishers B.V. http://www.kluweronline.com/issn/1384-5810/. current and previous offerings of Data Mining course at Stanford, CMU, MIT and Helsinki.
Text and Reference Material • The course will be mainly based on research literature, following text may however be consulted: • Jiawei Han and Micheline Kamber. “Data Mining: Concepts and Techniques”. • David Hand, Heikki Mannila and Padhraic Smyth. “Principles of Data Mining”. Pub. Prentice Hall of India, 2004. • Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia, Soft Computing and Bioinformatics”. Pub. Wiley an Sons Inc. 2003. • Usama M. Fayyad et al. “Advances in Knowledge Discovery and Data Mining”, The MIT Press, 1996.