620 likes | 629 Views
This chapter provides an introduction to data mining and knowledge discovery in databases, exploring the functionalities, issues, and potential applications. Topics covered include data cleaning, association analysis, classification and prediction, clustering, web mining, multimedia and spatial mining, and more.
E N D
Principles of Knowledge Discovery in Data Fall 2004 Chapter 1: Introduction to Data Mining Dr. Osmar R. Zaïane University of Alberta Principles of Knowledge Discovery in Data
Summary of Last Class • Course requirements and objectives • Evaluation and grading • Textbook and course notes (course web site) • Projects and survey papers • Course schedule • Course content • Questionnaire Principles of Knowledge Discovery in Data
Away (out of town) To be confirmed November 2nd November 4th Nov. 1-4: ICDM (New Version, Tentative) Course Schedule There are 14 weeks from Sept. 8th to Dec. 8th. First class starts September 9th and classes end December 7th. Thursday Tuesday Week 1: Sept. 9: Introduction Week 2: Sept. 14: Intro DM Sept. 16: DM operations Week 3: Sept. 21: Assoc. Rules Sept. 23: Assoc. Rules Week 4: Sept. 28: Data Prep. Sept. 30: Data Warehouse Week 5: Oct. 5: Char Rules Oct. 7: Classification Week 6: Oct. 12: Clustering Oct. 14: Clustering Week 7: Oct. 19: Web Mining Oct. 21: Spatial & MM Week 8: Oct. 26: Papers 1&2 Oct. 31: Papers 3&4 Week 9: Nov. 2: PPDM Nov. 4: Advanced Topics Week 10: Nov. 9: Papers 5&6 Nov. 11: No class Week 11: Nov. 16: Papers 7&8 Nov. 18: Papers 9&10 Week 12: Nov. 23: Papers 11&12 Nov. 25: Papers 13&14 Week 13: Nov. 30 Papers 15&16 Dec. 2: Project Presentat. Week 14: Dec. 7: Final Demos • Due dates • -Midterm week 8 • -Project proposals week 5 • -Project preliminary demo • week 12 • Project reports week 13 • Project final demo • week 14 3 Principles of Knowledge Discovery in Data
Course Content • Introduction to Data Mining • Data warehousing and OLAP • Data cleaning • Data mining operations • Data summarization • Association analysis • Classification and prediction • Clustering • Web Mining • Multimedia and Spatial Mining • Other topics if time permits Principles of Knowledge Discovery in Data
Chapter 1 Objectives Get a rough initial idea what knowledge discovery in databases and data mining are. Get an overview about the functionalities and the issues in data mining. Principles of Knowledge Discovery in Data
Databases are too big Data Mining can help discover knowledge Terrorbytes We Are Data Rich but Information Poor Principles of Knowledge Discovery in Data
We are merely trying to understand the consequences of the presence of the needle, if it exists. What Should We Do? We are not trying to find the needle in the haystack because DBMSs know how to do that. Principles of Knowledge Discovery in Data
What Led Us To This? • Necessity is the Mother of Invention • Technology is available to help us collect data • Bar code, scanners, satellites, cameras, etc. • Technology is available to help us store data • Databases, data warehouses, variety of repositories… • We are starving for knowledge (competitive edge, research, etc.) • We are swamped by data that continuously pours on us. • We do not know what to do with this data • We need to interpret this data in search for new knowledge Principles of Knowledge Discovery in Data
Evolution of Database Technology • 1950s: First computers, use of computers for census • 1960s: Data collection, database creation (hierarchical and network models) • 1970s: Relational data model, relational DBMS implementation. • 1980s: Ubiquitous RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). • 1990s: Data mining and data warehousing, massive media digitization, multimedia databases, and Web technology. Notice that storage prices have consistently decreased in the last decades Principles of Knowledge Discovery in Data
What Is Our Need? Extract interesting knowledge (rules, regularities, patterns, constraints) from data in large collections. Knowledge Data Principles of Knowledge Discovery in Data
A Brief History of Data Mining Research • 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro) Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) • Journal of Data Mining and Knowledge Discovery (1997) • 1998-2004 ACM SIGKDD conferences Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Data Collected • Business transactions • Scientific data (biology, physics, etc.) • Medical and personal data • Surveillance video and pictures • Satellite sensing • Games Principles of Knowledge Discovery in Data
Data Collected (Con’t) • Digital media • CAD and Software engineering • Virtual worlds • Text reports and memos • The World Wide Web Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Knowledge Discovery Process of non trivial extraction of implicit, previously unknown and potentially useful information from large collections of data Principles of Knowledge Discovery in Data
Many Steps in KD Process • Gathering the data together • Cleanse the data and fit it in together • Select the necessary data • Crunch and squeeze the data to extract the essence of it • Evaluate the output and use it Principles of Knowledge Discovery in Data
So What Is Data Mining? • In theory, Data Mining is a step in the knowledge discovery process. It is the extraction of implicit information from a large dataset. • In practice, data mining and knowledge discovery are becoming synonyms. • There are other equivalent terms: KDD, knowledge extraction, discovery of regularities, patterns discovery, data archeology, data dredging, business intelligence, information harvesting… • Notice the misnomer for data mining. Shouldn’t it be knowledge mining? Principles of Knowledge Discovery in Data
Data Mining: A KDD Process Knowledge • Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Selection and Transformation Data Warehouse Data Cleaning Data Integration Databases Principles of Knowledge Discovery in Data
Steps of a KDD Process • Learning the application domain (relevant prior knowledge and goals of application) • Gathering and integrating of data • Cleaning and preprocessing data (may take 60% of effort!) • Reducing and projecting data (Find useful features, dimensionality/variable reduction,…) • Choosing functions of data mining (summarization, classification, regression, association, clustering,…) • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Evaluating results • Interpretation: analysis of results. (visualization, alteration, removing redundant patterns, …) • Use of discovered knowledge Principles of Knowledge Discovery in Data
KDD Is an Iterative Process KDD Steps can be Merged Data cleaning + data integration = data pre-processing Data selection + data transformation = data consolidation Principles of Knowledge Discovery in Data
KDD at the Confluence of Many Disciplines DBMS Query processing Datawarehousing OLAP … Machine Learning Neural Networks Agents Knowledge Representation … Database Systems Artificial Intelligence Computer graphics Human Computer Interaction 3D representation … Information Retrieval Indexing Inverted files … Visualization High Performance Computing Statistics Parallel and Distributed Computing … Statistical and Mathematical Modeling … Other Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Data Mining: On What Kind of Data? • Flat Files • Heterogeneous and legacy databases • Relational databases and other DB: Object-oriented and object-relational databases • Transactional databases Transaction(TID, Timestamp, UID, {item1, item2,…}) Principles of Knowledge Discovery in Data
Three Dimensions Two Dimensions The Data Cube and The Sub-Space Aggregates Q3 Red Deer Q4 Lethbridge Q2 Q1 Calgary Edmonton By City By Time Group By Cross Tab By Time & City Category By Category Q1 Q4 Q3 Q2 Drama Drama Drama Comedy Comedy Comedy Horror Aggregate Horror Horror By Category & City By Time & Category By Time By Category Sum Sum Sum Sum Data Mining: On What Kind of Data? • Data warehouses Principles of Knowledge Discovery in Data
All, All, All Construction of Multi-dimensional Data Cube All Amount Algorithms, B.C. Amount 0-20K 20-40K 40-60K 60K- sum Province B.C. Prairies Algorithms Ontario sum Database Discipline … ... sum Principles of Knowledge Discovery in Data
Slice on January Cities Products Edmonton Electronics Dice on Electronics and Edmonton January January Cities Months Products Principles of Knowledge Discovery in Data
Spatial Databases Data Mining: On What Kind of Data? • Multimedia databases Principles of Knowledge Discovery in Data
Data Mining: On What Kind of Data? • Time Series Data and Temporal Data Principles of Knowledge Discovery in Data
The World Wide Web • The content of the Web • The structure of the Web • The usage of the Web Data Mining: On What Kind of Data? • Text Documents Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
What Can Be Discovered? What can be discovered depends upon the data mining task employed. • Descriptive DM tasks • Describe general properties • Predictive DM tasks • Infer on available data Principles of Knowledge Discovery in Data
Characterization: Summarization of general features of objects in a target class. (Concept description) Ex: Characterize grad students in Science Discrimination: Comparison of general features of objects between a target class and a contrasting class. (Concept comparison) Ex: Compare students in Science and students in Arts Data Mining Functionality Principles of Knowledge Discovery in Data
Data Mining Functionality (Con’t) • Association: Studies the frequency of items occurring together in transactional databases. Ex: buys(x, bread) à buys(x, milk). • Prediction: Predicts some unknown or missing attribute values based on other information. Ex: Forecast the sale value for next week based on available data. Principles of Knowledge Discovery in Data
Data Mining Functionality (Con’t) • Classification: Organizes data in given classes based on attribute values. (supervised classification) Ex: classify students based on final result. • Clustering: Organizes data in classes based on attribute values. (unsupervised classification) Ex: group crime locations to find distribution patterns. Minimize inter-class similarity and maximize intra-class similarity Principles of Knowledge Discovery in Data
Data Mining Functionality (Con’t) • Outlier analysis: Identifies and explains exceptions (surprises) • Time-series analysis: Analyzes trends and deviations; regression, sequential pattern, similar sequences… Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Is all that is Discovered Interesting? A data mining operation may generate thousands of patterns, not all of them are interesting. • Suggested approach: Human-centered, query-based, focused mining Data Mining results are sometimes so large that we may need to mine it too (Meta-Mining?) How to measure? Interestingness Principles of Knowledge Discovery in Data
Interestingness • Objective vs. subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g., support, confidence, lift, correlation coefficient etc. • Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. Interestingness measures: A pattern is interesting if it is • easily understood by humans • valid on new or test data with some degree of certainty. • potentially useful • novel, or validates some hypothesis that a user seeks to confirm Principles of Knowledge Discovery in Data
Can we Find All and Only the Interesting Patterns? • Find all the interesting patterns: Completeness. • Can a data mining system find all the interesting patterns? • Search for only interesting patterns: Optimization. • Can a data mining system find only the interesting patterns? • Approaches • First find all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns --- mining query optimization Like the concept of precision and recall in information retrieval Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Data Mining: Classification Schemes • There are many data mining systems. Some are specialized and some are comprehensive • Different views, different classifications: • Kinds of knowledge to be discovered, • Kinds of databases to be mined, and • Kinds of techniques adopted. Principles of Knowledge Discovery in Data
Four Schemes in Classification • Knowledge to be mined: • Summarization (characterization), comparison, association, classification, clustering, trend, deviation and pattern analysis, etc. • Mining knowledge at different abstraction levels: primitive level, high level, multiple-level, etc. • Techniques adopted: • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Principles of Knowledge Discovery in Data
Four Schemes in Classification (con’t) • Data source to be mined: (application data) • Transaction data, time-series data, spatial data, multimedia data, text data, legacy data, heterogeneous/distributed data, World Wide Web, etc. • Data model on which the data to be mined is drawn: • Relational database, extended/object-relational database, object-oriented database, deductive database, data warehouse, flat files, etc. Principles of Knowledge Discovery in Data
Designations for Mining Complex Types of Data • Text Mining: • Library database, e-mails, book stores, Web pages. • Spatial Mining: • Geographic information systems, medical image database. • Multimedia Mining: • Image and video/audio databases. • Web Mining: • Unstructured and semi-structured data • Web access pattern analysis Principles of Knowledge Discovery in Data
OLAP Mining: An Integration of Data Mining and Data Warehousing • On-line analytical mining of data warehouse data: integration of mining and OLAP technologies. • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Interactive characterization, comparison, association, classification, clustering, prediction. • Integration of different data mining functions, e.g., characterized classification, first clustering and then association, etc. (Source JH) Principles of Knowledge Discovery in Data
Introduction - Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • How do we categorize data mining systems? • What are the issues in Data Mining? • Are there application examples? Principles of Knowledge Discovery in Data
Requirements and Challenges in Data Mining • Security and social issues • User interface issues • Mining methodology issues • Performance issues • Data source issues Principles of Knowledge Discovery in Data
Requirements/Challenges in Data Mining (Con’t) • Security and social issues: • Social impact • Private and sensitive data is gathered and mined without individual’s knowledge and/or consent. • New implicit knowledge is disclosed (confidentiality, integrity) • Appropriate use and distribution of discovered knowledge (sharing) • Regulations • Need for privacy and DM policies Principles of Knowledge Discovery in Data
Requirements/Challenges in Data Mining (Con’t) • User Interface Issues: • Data visualization. • Understandability and interpretation of results • Information representation and rendering • Screen real-estate • Interactivity • Manipulation of mined knowledge • Focus and refine mining tasks • Focus and refine mining results Principles of Knowledge Discovery in Data