1 / 74

Last lecture summary

Last lecture summary. Test-data and Cross Validation. testing error. training error. model complexity. Test set method . Split the data set into training and test data sets. Common ration – 70:30 Train the algorithm on training set, assess its performance on the test set. Disadvantages

marilu
Download Presentation

Last lecture summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Last lecture summary

  2. Test-data and Cross Validation

  3. testing error training error model complexity

  4. Test set method • Split the data set into training and test data sets. • Common ration – 70:30 • Train the algorithm on training set, assess its performance on the test set. • Disadvantages • This is simple, however it wastes data. • Test set estimator of performance has high variance Train Test adopted from Cross Validation tutorial, Andrew Moore http://www.autonlab.org/tutorials/overfit.html

  5. stratified division • same proportion of data in the training and test sets

  6. Training error can not be used as an indicator of model’s performance due to overfitting. • Training data set - train a range of models, or a given model with a range of values for its parameters. • Compare them on independent data – Validation set. • If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third • Test set on which the performance of the selected model is finally evaluated.

  7. LOOCV • choose one data point • remove it from the set • fit the remaining data points • note your error using the removed data point as test Repeat these steps for all points. When you are done report the mean square error (in case of regression).

  8. k-fold crossvalidation • randomly break data into k partitions • remove one partition from the set • fit the remaining data points • note your error using the removed partition as test data set Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).

  9. Selection and testing • Complete procedure to algorithm selection and estimation of its quality • Divide data to train/test • By Cross Validation on the Train choose the algorithm • Use this algorithm to construct a classifier using Train • Estimate its quality on the Test Train Test Val Train Train Test

  10. adopted from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html Model selection via CV polynomial regression

  11. Nearest Neighbors Classification

  12. instances

  13. Similaritysij is quantity that reflects the strength of relationship between two objects or two features. • Distancedij measures dissimilarity • Dissimilarity measure the discrepancy between the two objects based on several features. • Distance satisfies the following conditions: • distance is always positive or zero (dij≥ 0) • distance is zero if and only if it measured to itself • distance is symmetric (dij = dji) • In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.

  14. Distances for quantitative variables • Minkowski distance (Lp norm) • distance matrix – matrix with all pairwise distances

  15. Manhattan distance y2 x2 y1 x1

  16. Euclidean distance y2 x2 y1 x1

  17. k-NN • supervised learning • target function f may be • dicrete-valued (classification) • real-valued (regression) • We assign to the class which instance is most similar to the given point.

  18. k-NN is a lazy learner • lazy learning • generalization beyond the training data is delayed until a query is made to the system • opposed to eager learning – system tries to generalize the training data before receiving queries

  19. Which k is best? Crossvalidation k = 1 k = 15 fitting noise, outliers overfitting value not too small smooth out distinctive behavior Hastie et al., Elements of Statistical Learning

  20. Real-valued target function • Algorithm calculates the mean value of the k nearest training examples. k = 3 value = 12 value = (12+14+10)/3 = 12 value = 14 value = 10

  21. Distance-weighted NN • Give greater weight to closer neighbors • unweighted • 2 votes • 2 votes • weighted • 1/12 + 1/22 = 1.25 votes • 1/42 + 1/52 = 0.102 votes k = 4 4 2 5 1

  22. k-NN issues • Curse of dimensionality is a problem. • Significant computation may be required to process each new query. • To find nearest neighbors one has to evaluate full distance matrix. • Efficient indexing of stored training examples helps • kd-tree

  23. Cluster Analysis

  24. We have data, we don’t know classes. • Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

  25. We have data, we don’t know classes. • Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

  26. Stages of clustering process On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis

  27. How would you solve the problem? • How to find clusters? • Group together most similar patterns.

  28. Single linkage(metoda nejbližšího souseda) based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

  29. Milano Torino Florence Rome Bari Naples

  30. 877 Milano Torino Florence 877 996 Rome Bari Naples

  31. 877 295 Milano Torino 295 400 Florence Rome Bari Naples

  32. 877 295 754 Milano Torino Florence 754 869 Rome Bari Naples

  33. 877 295 564 Milano Torino Florence 564 669 Rome Bari Naples

  34. Milano Torino Florence Rome Bari Naples

  35. Milano Torino Florence Rome Bari Naples

  36. Milano Torino Florence Rome Bari Naples

  37. Milano Torino Florence Rome Bari Naples

  38. Dendrogram Torino → Milano Rome → Naples → Bari → Florence Join Torino–Milano and Rome–Naples–Bari–Florence

  39. Dendrogram Torino → Milano (138) Rome → Naples (219) → Bari (255) → Florence (268) Join Torino–Milano and Rome–Naples–Bari–Florence (295) 295 dissimilarity 268 255 219 138 BA NA RM FL MI TO

  40. Milano Milano Milano dissimilarity Torino Torino Torino Florence BA NA RM FL MI TO Florence Florence Rome Rome Rome Bari Bari Bari Naples Naples Naples

  41. Complete linkage(metoda nejvzdálenějšího souseda)

  42. Milano Torino Florence Rome Bari Naples

  43. 996 Milano Torino Florence 877 996 Rome Bari Naples

  44. 400 996 Milano Torino 295 400 Florence Rome Bari Naples

  45. 400 869 996 Milano Torino Florence 754 869 Rome Bari Naples

  46. 400 869 996 669 Milano Torino Florence 564 669 Rome Bari Naples

  47. Milano Torino Florence Rome Bari Naples

More Related