Outline of the presentation

Outline of the presentation Objectives, Prerequisite and Content Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion 2

Objectives This course provides: • fundamental techniques of knowledge discovery and data mining (KDD) • issues in KDD practical use and tools • case-studies of KDD application 3

Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases, statistics, • and mathematics • programming skills 4

Content of the course • Overview of KDD • Mining association rules • Mining action rules • Decision tree induction • Distributed knowledge systems and distributed query answering • Cluster analysis 5

Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion 6

Brief introduction to lectures Overview of KDD 7

Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 8

KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: we never see the whole data set, so will put it in the memory of computers What is the knowledge? How to represent and use it? Then run Data Mining algorithms 9

Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization. 10

From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0,　0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS,　VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy] 11

Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. People gathered and stored so much data because they think some valuable assets are implicitly coded within it. ? knowledge base inference engine Raw data is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs 12

Generate Benefits of Knowledge Discovery Value Disseminate DSS MIS EDP Rapid Response Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems 13

Multiple process useful understandable novel Previously unknown Can be used Justified patterns/models by human and machine valid non-trivial process The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) 15

The Knowledge Discovery Process 5 a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations Putting the results in practical use 4 Interpret and Evaluate discovered knowledge 3 Data Mining Extract Patterns/Models 2 Collect and Preprocess Data 1 KDD is inherently interactive and iterative Understand the domain and Define problems 16

Find important attributes & value ranges Select sampling technique and sample data Transform to different representation Refine knowledge Supply missing values Create derived attributes Create/select target database Test knowledge Extract knowledge Select DM task (s) Select DM method (s) Transform values Normalize values Eliminate noisy data The KDD Process Data organized by function Data warehousing 1 2 4 3 Query & report generation Aggregation & sequences Advanced methods 5 17

Main Contributing Areas of KDD Statistics [data warehouses: integrated data] Infer info from data (deduction & induction, mainly numeric data) [OLAP: On-Line Analytical Processing] KDD Databases Machine Learning Store, access, search, update data (deduction) Computer algorithms that improve automatically through experience (mainly induction, symbolic data) 18

Potential Applications Manufacturing information Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. - Controlling and scheduling - Network management - Experiment result analysis - etc. Personal information Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc. 20

KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) 21

KDD: A New and Fast Growing Area KDD workshops: since 1989. Inter. Conferences: KDD (USA), first in 1995; PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997. ML’04/PKDD’04 (in Pisa, Italy) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning). “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ. 22

Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. identifying a finite set of categories or clusters to describe the data. Clustering Classification finding a model which describes significant dependencies between variables. maps a data item to a real-valued prediction variable. Regression Dependency Modeling discovering the most significant changes in the data finding a compact description for a subset of data Deviation and change detection Summarization 24

Classification “What factors determine cancerous cells?” Examples General patterns Data Mining Algorithm - Rule Induction - Decision tree - Neural Network Classification Algorithm Cancerous Cell Data 25

Classification: Rule Induction “What factors determine a cell is cancerous?” If Color = light and Tails = 1 and Nuclei = 2 ThenHealthy Cell(certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 ThenCancerous Cell(certainty = 87%) 26

Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 cancerous healthy #tails=1 #tails=2 #tails=1 #tails=2 healthy cancerous healthy cancerous 27

Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous 28

Outline of the presentation