1 / 28

UNESCO courses: Module on Knowledge Discovery and Data Mining

UNESCO courses: Module on Knowledge Discovery and Data Mining. Prof. Ho Tu Bao Prof. Bach Hung Khang Institute of Information Technology Japan Advanced Institute of Science and Technology. Outline of the presentation. Objectives, Prerequisite and Content. Objectives,

ohio
Download Presentation

UNESCO courses: Module on Knowledge Discovery and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNESCO courses: Module on Knowledge Discovery and Data Mining Prof. Ho Tu Bao Prof. Bach Hung Khang Institute of Information Technology Japan Advanced Institute of Science and Technology

  2. Outline of the presentation Objectives, Prerequisite and Content Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”

  3. Objectives This course provides: • fundamental techniques of knowledge discovery and data mining (KDD) • issues in KDD practical use and tools • case-studies of KDD application

  4. Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases and statistics • programming skill for advanced levels

  5. Content of the course Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge

  6. Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”

  7. Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge

  8. Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD

  9. KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers What knowledge? How to represent and use it? Data mining algorithms?

  10. Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization.

  11. From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy]

  12. Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. People gathered and stored so much data because they think some valuable assets are implicitly coded within it. ? knowledge base inference engine Rawdata is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs

  13. Benefits of Knowledge Discovery Value Disseminate DSS Generate MIS EDP Rapid Response Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems

  14. Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD

  15. Multiple process non-trivial process Justified patterns/models valid novel Previously unknown useful Can be used understandable by human and machine The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

  16. Understand the domain and Define problems Collect and Preprocess Data Data Mining Extract Patterns/Models Interpret and Evaluate discovered knowledge Putting the results in practical use The Knowledge Discovery Process 5 a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 4 3 2 1 KDD is inherently interactive and iterative

  17. The KDD Process Data organized by function Create/select target database Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data 2 Normalize values Transform values Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods Transform to different representation 5

  18. Main Contributing Areas of KDD Statistics [data warehouses: integrated data] Infer info from data (deduction & induction, mainly numeric data) [OLAP: On-Line Analytical Processing] KDD Databases Machine Learning Store, access, search, update data (deduction) Computer algorithms that improve automatically through experience (mainly induction, symbolic data)

  19. Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD

  20. Potential Applications Manufacturing information Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. - Controlling and scheduling - Network management - Experiment result analysis - etc. Personal information Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc.

  21. KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.)

  22. KDD: A New and Fast Growing Area KDD workshops: 1989, 1991,1993, 1994. Inter. Conferences: KDD’95, 96, 97, 98, 99 (USA) PAKDD’97, 98, 99 (Asia) , PKDD’97, 98, 99 (Europe) PAKDD’00 (Kyoto, 2000.4.18-20, deadline 99.10.10) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … 80% of the Fortune 500 companies are currently involved in data mining pilot projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning, recently more attention on knowledge acquisition and machine learning). Interests in KDD: Special Issue on KDD of JSAI, July 1997. “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

  23. Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD

  24. Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. identifying a finite set of categories or clusters to describe the data. Clustering Classification finding a model which describes significant dependencies between variables. maps a data item to a real-valued prediction variable. Regression Dependency Modeling discovering the most significant changes in the data finding a compact description for a subset of data Deviation and change detection Summarization

  25. Classification “What factors determine cancerous cells?” Examples General patterns Data Mining Algorithm - Rule Induction - Decision tree - Neural Network Classification Algorithm Cancerous Cell Data

  26. Classification: Rule Induction “What factors determine a cell is cancerous?” If Color = light and Tails = 1 and Nuclei = 2 ThenHealthy Cell(certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 ThenCancerous Cell(certainty = 87%)

  27. Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 cancerous healthy #tails=1 #tails=2 #tails=1 #tails=2 healthy cancerous healthy cancerous

  28. Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous

More Related