1 / 47

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113. CSE 711 Texts. Required Text 1. Witten, I. H., and E. Frank , Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations , Morgan Kaufmann, 2000.

ledell
Download Presentation

CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113

  2. CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

  3. CSE 711 Texts 2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.

  4. Introduction • Challenge: How to manage ever-increasing amounts of information • Solution: Data Mining and Knowledge Discovery Databases (KDD)

  5. Information as a Production Factor • Most international organizations produce more information in a week than many people could read in a lifetime

  6. Data Mining Motivation • Mechanical production of data need for mechanical consumption of data • Large databases = vast amounts of information • Difficulty lies in accessing it

  7. KDD and Data Mining • KDD: Extraction of knowledge from data • Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” • Data Mining: Discovery stage of the KDD process

  8. Data Mining • Process of discovering patterns, automatically or semi-automatically, in large quantities of data • Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic

  9. KDD and Data Mining Machine learning Export systems KDD Database Statistics Visualization Figure 1.1 Data mining is a multi-disciplinary field.

  10. Data Mining vs. Query Tools • SQL: When you know exactly what you are looking for • Data Mining: When you only vaguely know what you are looking for

  11. Practical Applications • KDD more complicated than initially thought • 80% preparing data • 20% mining data

  12. Data Mining Techniques • Not so much a single technique • More the idea that there is more knowledge hidden in the data than shows itself on the surface

  13. Data Mining Techniques • Any technique that helps to extract more out of data is useful • Query tools • Statistical techniques • Visualization • On-line analytical processing (OLAP) • Case-based learning (k-nearest neighbor)

  14. Data Mining Techniques • Decision trees • Association rules • Neural networks • Genetic algorithms

  15. Machine Learning and theMethodology of Science Analysis Observation Theory Prediction Empirical cycle of scientific research

  16. Machine Learning... Analysis Limited number of observation Theory ‘All swans are white’ Reality: Infinite number of swans Theory formation

  17. Machine Learning... Theory “All swans are white” Single observation Reality: Infinite number of swans Prediction Theory falsification

  18. A Kangaroo in Mist a.) b.) c.) d.) e.) f.) Complexity of search spaces

  19. Association Rules Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item.

  20. Association Rules Intuitive meaning of such a rule: transactions in the database which contain the items in Xtend also to contain the items in Y.

  21. Association Rules Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.

  22. Association Rules Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.

  23. Example Data Sets • Contact Lens (symbolic) • Weather (symbolic data) • Weather ( numeric +symbolic) • Iris (numeric; outcome:symbolic) • CPU Perf.(numeric; outcome:numeric) • Labor Negotiations (missing values) • Soybean

  24. Contact Lens Data

  25. Structural Patterns • Part of structural description • Example is simplistic because all combinations of possible values are represented in table If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

  26. Structural Patterns • In most learning situations, the set of examples given as input is far from complete • Part of the job is to generalize to other, new examples

  27. Weather Data

  28. Weather Problem • This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes

  29. Weather Data with Some Numeric Attributes

  30. Classification and AssociationRules • Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no

  31. Classification and AssociationRules • Association Rules: rules which strongly associate different attribute values • Association rules which derive from weather table If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

  32. Rules for Contact Lens Data If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

  33. Decision Tree for Contact Lens Data tear production rate astigmatism none spectacle prescription soft hard none

  34. Iris Data

  35. Iris Rules Learned • If petal-length <2.45 then Iris-setosa • If sepal-width <2.10 then Iris-versicolor • If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor • ...

  36. CPU Performance Data

  37. CPU Performance • Numerical Prediction: outcome as linear sum of weighted attributes • Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX • Regression can discover linear relationships, not non-linear ones

  38. Linear Regression Regression Line Debt Income A simple linear regression for the loan data set

  39. Labor Negotiations Data

  40. Decision Trees for ... Wage increase first year  2.5 > 2.5 Bad Statutory holidays > 10  10 Good Wage increase first year < 4  4 Bad Good

  41. … Labor Negotiations Data Wage increase first year  2.5 > 2.5 Working hours per week Statutory holidays > 36  36 > 10  10 Bad Health plan contribution Good Wage increase first year none full  4 < 4 half Bad Good Bad Bad Good

  42. Soy Bean Data

  43. Two Example Rules If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If [leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot

  44. Classification Debt No loan Loan Income A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”

  45. Clustering Debt Cluster 1 Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s

  46. Non-Linear Classification No Loan Debt Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set

  47. Nearest Neighbor Classifier Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set

More Related