UNESCO courses: Module on Knowledge Discovery and Data Mining

UNESCO courses: Module on Knowledge Discovery and Data Mining Prof. Ho Tu Bao Prof. Bach Hung Khang Institute of Information Technology Japan Advanced Institute of Science and Technology

Outline of the presentation Objectives, Prerequisite and Content Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”

Objectives This course provides: • fundamental techniques of knowledge discovery and data mining (KDD) • issues in KDD practical use and tools • case-studies of KDD application

Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases and statistics • programming skill for advanced levels

Content of the course Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge

Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”

Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge

Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD

KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers What knowledge? How to represent and use it? Data mining algorithms?

Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization.

From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0,　0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS,　VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy]

Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. People gathered and stored so much data because they think some valuable assets are implicitly coded within it. ? knowledge base inference engine Rawdata is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs

Benefits of Knowledge Discovery Value Disseminate DSS Generate MIS EDP Rapid Response Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems

Multiple process non-trivial process Justified patterns/models valid novel Previously unknown useful Can be used understandable by human and machine The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

Understand the domain and Define problems Collect and Preprocess Data Data Mining Extract Patterns/Models Interpret and Evaluate discovered knowledge Putting the results in practical use The Knowledge Discovery Process 5 a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 4 3 2 1 KDD is inherently interactive and iterative

The KDD Process Data organized by function Create/select target database Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data 2 Normalize values Transform values Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods Transform to different representation 5

Main Contributing Areas of KDD Statistics [data warehouses: integrated data] Infer info from data (deduction & induction, mainly numeric data) [OLAP: On-Line Analytical Processing] KDD Databases Machine Learning Store, access, search, update data (deduction) Computer algorithms that improve automatically through experience (mainly induction, symbolic data)

Potential Applications Manufacturing information Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. - Controlling and scheduling - Network management - Experiment result analysis - etc. Personal information Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc.

KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.)

KDD: A New and Fast Growing Area KDD workshops: 1989, 1991,1993, 1994. Inter. Conferences: KDD’95, 96, 97, 98, 99 (USA) PAKDD’97, 98, 99 (Asia) , PKDD’97, 98, 99 (Europe) PAKDD’00 (Kyoto, 2000.4.18-20, deadline 99.10.10) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … 80% of the Fortune 500 companies are currently involved in data mining pilot projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning, recently more attention on knowledge acquisition and machine learning). Interests in KDD: Special Issue on KDD of JSAI, July 1997. “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. identifying a finite set of categories or clusters to describe the data. Clustering Classification finding a model which describes significant dependencies between variables. maps a data item to a real-valued prediction variable. Regression Dependency Modeling discovering the most significant changes in the data finding a compact description for a subset of data Deviation and change detection Summarization

Classification “What factors determine cancerous cells?” Examples General patterns Data Mining Algorithm - Rule Induction - Decision tree - Neural Network Classification Algorithm Cancerous Cell Data

Classification: Rule Induction “What factors determine a cell is cancerous?” If Color = light and Tails = 1 and Nuclei = 2 ThenHealthy Cell(certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 ThenCancerous Cell(certainty = 87%)

Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 cancerous healthy #tails=1 #tails=2 #tails=1 #tails=2 healthy cancerous healthy cancerous

Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous

UNESCO courses: Module on Knowledge Discovery and Data Mining

UNESCO courses: Module on Knowledge Discovery and Data Mining

Presentation Transcript

Intel Design and Discovery Robotics Module

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

Introduction

Data Mining Classification: Alternative Techniques

DATA MINING LECTURE 4

Web Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

DATA MINING FOR INTRUSION DETECTION

15-826: Multimedia Databases and Data Mining

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques