1 / 27

Exam Data Mining

Exam Data Mining. Preparation. Exam. January 4, 2019 14:00 – 17:00 rooms Gorl01 and Gorl02. Material. Slides of lectures see http://datamining.liacs.nl/DaMi/ Practicals Handouts Association Analysis (up to 6.4 (incl.)) Paper Maximally Informative k -Itemsets Book Weka/Cortana

mfavors
Download Presentation

Exam Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exam Data Mining Preparation

  2. Exam • January 4, 2019 • 14:00 – 17:00 • rooms Gorl01 and Gorl02

  3. Material • Slides of lectures • see http://datamining.liacs.nl/DaMi/ • Practicals • Handouts Association Analysis (up to 6.4 (incl.)) • Paper Maximally Informative k-Itemsets • Book • Weka/Cortana • (example exam and model answers)

  4. Exam • Mixture of • topics • level • knowledge/application • Emphasis on big picture and understanding • technical details mostly not essential • apply standard algorithms on simple example data

  5. Short Questions • Give an example (e.g. with some data) of an XOR problem with four binary attributes, of which one is the target • Give two advantages of hierarchical clustering, compared to a more standard algorithm such as k-means • Explain the difference between 10-fold cross validation and leave-one-out. In what circumstances would you choose for the second alternative?

  6. Regression • Both Regression Trees and Model Trees can be used for regression. Explain the major difference between these two methods • Explain why an RT is often bigger than an MT on the same data • The measure R2 is often used to asses the quality of a regression model. Give the definition of this measure, and explain why this definition makes sense. If helpful, use a diagram to explain the intuition behind this definition

  7. R2 (R squared) • R2 = 1−SSres/SStot • SSres = Σ(yi−fi)2 difference between actual and predicted • SStot = Σ(yi−y)2difference between actual and horizontal line

  8. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. • Give the cross table (contingency table) of this subgroup. Put the subgroup S and its complement S’ on the left, and the values of the target (T and F) at the top. Also add the totals of the columns and rows at the bottom and right of the table.

  9. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  10. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  11. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  12. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  13. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  14. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. target subgroup

  15. Subgroup Discovery Assume a dataset with multiple attributes, of which one is the (binary) target. The dataset contains 1000 examples, of which 100 are positive. Let’s assume that we have evaluated a subgroup S on the data, and it turns out that it covers 20% of the dataset. Additionally, 90 positive cases turn out to be covered by S. • Draw the ROC space for this dataset, and indicate the location of S.

  16. S

  17. Subgroup Discovery • Give two examples of a quality measure for binary targets, and draw the isometrics of the measures. Additionally, indicate which isometrics represent positive values. Information Gain

  18. Other Measures Precision Gini index Correlation coefficient Foil gain

  19. Decision Trees Assume a dataset of two numeric attributes (x and y) and a binary target. The 10,000 examples are uniformly distributed over the area [0..10]x[0..10]. An example is positive if it is situated inside a circle with radius 3 and center point (3, 3). Else, it is negative. • Give an example of a decision tree of depth 2 (so at most two splits per path from the root to the leaf), as it probably will be produced by an algorithm based on information gain.

  20. Vraag: Decision Tree Assume a dataset of two numeric attributes (x and y) and a binary target. The 10,000 examples are uniformly distributed over the area [0..10]x[0..10]. An example is positive if it is situated inside a circle with radius 3 and center point (3, 3). Else, it is negative. • Give a suggestion for how this dataset can be modelled more easily and precisely with a decision tree.

  21. Entropy The table below contains 4 binary attributes X Y Z T T T T T T T T T F T F F F T F F T T F F T F F F F F T T F F T F • Compute the entropy of each attribute • Compute the information gain from Zto T • Give an upper bound on the joint entropy of ZT and argue why this is an upper bound • Compute the joint entropy of ZT

  22. FP Mining • Draw the itemset lattice and indicate: • (M) Maximal, (C) Closed, (N) frequent but not closed/maximal, (I) infrequent. Assume minsup = 0.3 • Association rules: Find a pair of itemsets a, b for which holds:

  23. FP Mining Items = {Bread, Cookies, Diapers, Milk} Minsup = 0.3  support count ≥ 2 N C N C B C D M M I I N C C B,C B,D B,M C,D C,M D,M I I I M B,C,D B,C,M B,D,M C,D,M I B, C, D, M

  24. FP Mining Items = {Bread, Cookies, Diapers, Milk} σ(a) = σ(b) OK: {Milk}  {Cookies} Not OK: {Bread}  {Milk, Cookies}

  25. Clustering • Cluster the values by means of the k-Means algorithm • Initialise with random cluster centers (0,6) en (7,2) • Describe the iterations until k-Means converges

  26. k-Means 3 2 c1 c2 1 4 • Cluster using C1=(0,6), C2=(7,2) • Cluster 1: {1,3} • Cluster 2: {2,4} • E.g. point 2: • d(2,c1)=sqrt(72+22)=sqrt(53) • d(2,c2)=sqrt(0+62)=sqrt(36) -> c2 • Recompute cluster centers • C1’: ((0+4)/2,(0+8)/2)=(2,4) • C2’: ((3+7)/2,(0+8)/2)= (5,4)

  27. k-Means 3 2 c1’ c2’ 1 4 • Cluster using c1’(2,4) and c2’ (5,4) • C1’={1,4} • C2’={2,3} • E.g., point 4: • d(4,c1’)=sqrt(12+42)=sqrt(17) -> c1’ • d(4,c2’)=sqrt(22+42)=sqrt(20) • Recompute cluster centers • C1’: ((0+3)/2,(0+0)/2)=(1.5,0) • C2’: ((4+7)/2,(8+8)/2)= (5.5,8) • Clusters don’t change anymore • Converged, stop

More Related