1 / 35

Understanding Similarity and Distance in Data Mining

Learn about the importance of similarity and distance in data mining, its applications in clustering and machine learning methods, and the impact of validation and overfitting. Explore k-nearest neighbor classifiers and different distance measures like Euclidean, Manhattan, and Chebychev. Understand concepts like geographic clusters and the Jaccard coefficient.

macon
Download Presentation

Understanding Similarity and Distance in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining(and machine learning) DM Lecture 6: Similarity and Distance David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  2. Today • Similarity / Distance between data records, is used in: • Clustering • Many machine learning methods • Many, many, many practical applications • More fundamentally: • Sometimes data records are entirely unstructured – e.g free text answers in a questionnaire, news articles, etc. • To do DM/ML they need to be structured somehow • Then we can cluster them, etc, etc … • Plus • Notes about validation and overfitting David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  3. k-nearest neighbour The simplest machine learning method of all! up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  4. k-nearest neighbour A new point: should it be classed as Up or Down ? up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  5. k-nearest neighbour A 1-NN classifier says: UP up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  6. k-nearest neighbour • A 3-NN classifier says: DOWN up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  7. k-nearest neighbour • A 5-NN classifier says: DOWN up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  8. k-nearest neighbour What might 3-NN say in this case, and would it be correct? up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  9. k-nearest neighbour What might 3-NN say in this case, and would it be correct? up down David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  10. K-NN • Extremely simple • Often very good performance • Most suitable for datasets where there are clear `geographic’ clusters • Even on complex data, provides a good guess • Like almost all DM/ML techniques, it relies exclusively on a distance measure – different ways to measure distance will give different results. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  11. k-nearest neighbour By 3-NN, is the red car a Ford or a Chevrolet? Ford Chevrolet Cost $ Miles per gallon David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  12. k-nearest neighbour By 3NN, is the red car a Ford or a Chevrolet? Ford Chevrolet Cost (cents) Miles per gallon David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  13. Distance measures Euclidean distance: Point 1 is: Point 2 is: Euclidean distance is: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  14. Distance measures Manhattan distance (aka city-block distance) Point 1 is: Point 2 is: Manhattan distance is: (in case you don’t know: is the absolute value of x. ) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  15. Distance measures Chebychev distance Point 1 is: Point 2 is: Chebychev distance is: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  16. Distance measures (red, male, big, hot) (green, male, small, hot) Proportion different Point 1 is: Point 2 is: Proportion different is: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  17. Distance measures (bread, cheese, milk, nappies) (batteries, cheese) Jaccard coefficient Point 1 is a set: A Point 2 is a set: B Jaccard Coefficient is: The number of things that appear in both (1 - cheese), divided by the total number of different things (5)) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  18. Using common sense Data vectors are: (colour, manufacturer, top-speed) e.g.: (red, ford, 180) (yellow, toyota, 160) (silver, bugatti, 300) What distance measure will you use? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  19. Using common sense Data vectors are : (colour, manufacturer, top-speed) e.g.: (dark, ford, high) (medium, toyota, high) (light, bugatti, very-high) What distance measure will you use? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  20. Using common sense With different types of fields, e.g. p1 = (red, high, 0.5, UK, 12) p2 = (blue, high, 0.6, France, 15) You could simply define a distance measure for each field Individually, and add them up. Similarly, you could divide the vectors into ordinal and numeric parts: p1a = (red, high, UK) p1b = (0.5, 12) p2a = (blue, high, France) p2b = (0.6, 15) and say that dist(p1, p2) = dist(p1a,p2a)+d(p1b,p2b) using appropriate measures for the two kinds of vector. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  21. Edit distance is useful in many applications: see http://www.merriampark.com/ld.htm Notes … Suppose one field varies hugely (standard deviation is 100), and one field varies a tiny amount (standard deviation 0.001) – why is Euclidean distance a bad idea? What can you do? What is the distance between these two? “Star Trek: Voyager” “Satr Trek: Voyagger” Normalising fields individually is often a good idea – when a numerical field is normalised, that means you scale it so that the mean is 0 and the standard deviation is 1. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  22. Text: a prime example of unstructured data David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  23. How did I get these vectors from these two `documents’? <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> 26, 2, 2 35, 2, 0 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  24. What about these two vectors? <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> 1, 1, 1, 0, 0, 0 0, 0, 0, 1, 1, 1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  25. An unfair question, but I got that by using the following word vector: (Crossword, Cryptic, Difficult, Expression, Lexical, Token) If a document contains the word `crossword’, it gets a 1 in position 1 of the vector, otherwise 0. If it contains `lexical’, it gets a 1 in position 5, otherwise 0, and so on. How similar would be the vectors for two pages about crossword compilers? The key to measuring document similarity is turning documents into vectors based on specific words and their frequencies. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  26. Turning a document into a vector We start with a template for the vector, which needs a master list of terms . A term can be a word, or a number, or anything that appears frequently in documents. There are almost 200,000 words in English – it would take much too long to process documents vectors of that length. Commonly, vectors are made from a small number (50—1000) of most frequently-occurring words. However, the master list usually does not include words from a stoplist, Which contains words such as the, and, there, which, etc … why? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  27. Turning a document into a vector Suppose our Master List is: (banana, cat, dog, fish, read) Suppose documents 1,2,3 are: “Bananas are grown in hot countries, and cats like bananas.” “It is raining cats and dogs today” “cats like seafood” Assuming I first do stemming, or equivalent, the vector encodings of these documents could be: 1. (2, 1, 0, 0, 0) 2. (0, 1, 1, 0, 0) 3. (0, 1, 0, 0, 0) What distance measure would you use? Does it make any sense? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  28. Text mining Encoding documents as vectors is a hot topic, and there are many important and valuable applications, e.g.: Predicting `sentiment’ – if a document describes a movie, how much does it rate that movie, on a scale of 1 to 10? What document(s) are the most appropriate for a search engine to retrieve from a search query? Is document A plagiarized from document B? And, there are better standard ways to encode documents, such as TFIDF - I cover that in the Web Intelligence module. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  29. Overfitting Suppose we train an a classifier to tell the difference between handwritten t and c, using only these examples: ts The classifier will learn easily. It will probably gives 100% correct prediction on these cases. cs David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  30. Overfitting BUT; this classifier will probably generalise very poorly; it will perform very badly on a test set E.g. here is potential (very likely) performance on certain unseen cases It will probably predict that this is a c Why? It will probably predict that this is a t David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  31. Avoiding Overfitting • It can be avoided by using as much training data as possible,ensuring as much diversity as possible in the data. • This cuts down on the potential existence of features that might be discriminative in the training data, but are otherwise spurious. • It can be avoided by jittering(adding noise). • During training, every time an input pattern ispresented, it is randomly perturbed. The idea of this is that spurious features will be `washed out’ by the noise, but valid discriminatory features will remain. • The problem with this approach is how to correctly choose the level of noise. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  32. Validation data Starting to overfit Avoiding Overfitting II A typical curve showing performance during training. But here is performance on unseen data, not in the training set. Training data error Time – for methods like neural networks David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  33. Avoiding Overfitting III Another approach is early stopping. During training, keep track of the network’s performance on a separate validation set of data. At the point where error continues to improve on the training set, but starts to get worse on the validation set, that is when training should be stopped, since it is starting to overfit on the training data. The problem here is that this point is far from always clear cut. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  34. Coursework C Invent two different but simple ways to use the Jaccard coefficient for the problem of finding the distance between a document and a set of documents. Explain both within about 200 words (half a page) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  35. Coursework C Next week: feature selection David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More Related