290 likes | 325 Views
ITCS 6162 KDD Class Fall 2007. Transparencies made by Ho Tu Bao [JAIST]. Outline of the presentation. Objectives, Prerequisite and Content. Objectives, Prerequisite and Content. Brief Introduction to Lectures. Discussion and Conclusion .
E N D
ITCS 6162 KDD Class Fall 2007 Transparencies made by Ho Tu Bao [JAIST]
Outline of the presentation Objectives, Prerequisite and Content Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”
Objectives This course provides: • fundamental techniques of knowledge discovery and data mining (KDD) • issues in KDD practical use and tools • case-studies of KDD application
Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases and statistics • programming skill for advanced levels
Content of the course Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”
Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers What knowledge? How to represent and use it? Data mining algorithms?
Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization.
From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy]
Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. People gathered and stored so much data because they think some valuable assets are implicitly coded within it. ? knowledge base inference engine Rawdata is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs
Benefits of Knowledge Discovery Value Disseminate DSS Generate MIS EDP Rapid Response Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems
Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
Multiple process non-trivial process Justified patterns/models valid novel Previously unknown useful Can be used understandable by human and machine The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
Understand the domain and Define problems Collect and Preprocess Data Data Mining Extract Patterns/Models Interpret and Evaluate discovered knowledge Putting the results in practical use The Knowledge Discovery Process 5 a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 4 3 2 1 KDD is inherently interactive and iterative
The KDD Process Data organized by function Create/select target database Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data 2 Normalize values Transform values Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods Transform to different representation 5
Main Contributing Areas of KDD Statistics [data warehouses: integrated data] Infer info from data (deduction & induction, mainly numeric data) [OLAP: On-Line Analytical Processing] KDD Databases Machine Learning Store, access, search, update data (deduction) Computer algorithms that improve automatically through experience (mainly induction, symbolic data)
Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
Potential Applications Manufacturing information Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. - Controlling and scheduling - Network management - Experiment result analysis - etc. Personal information Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc.
KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.)
KDD: A New and Fast Growing Area KDD workshops: since 1989. Inter. Conferences: KDD (USA), first in 1995; PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997. ML’04/PKDD’04 (in Pisa, Italy) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning). “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.
Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD
Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. identifying a finite set of categories or clusters to describe the data. Clustering Classification finding a model which describes significant dependencies between variables. maps a data item to a real-valued prediction variable. Regression Dependency Modeling discovering the most significant changes in the data finding a compact description for a subset of data Deviation and change detection Summarization
Classification “What factors determine cancerous cells?” Examples General patterns Data Mining Algorithm - Rule Induction - Decision tree - Neural Network Classification Algorithm Cancerous Cell Data
Classification: Rule Induction “What factors determine a cell is cancerous?” If Color = light and Tails = 1 and Nuclei = 2 ThenHealthy Cell(certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 ThenCancerous Cell(certainty = 87%)
Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 cancerous healthy #tails=1 #tails=2 #tails=1 #tails=2 healthy cancerous healthy cancerous
Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous