790 likes | 806 Views
Some of these slides are not printer friendly!!. You may want to delete them, or delete the backgrounds before you physically print them. Similarity Search for Data Mining. Two Key Issues How to measure similarity properly Invariances (for distance measures) How to do it fast
E N D
Some of these slides are not printer friendly!! • You may want to delete them, or delete the backgrounds before you physically print them
Similarity Search for Data Mining • Two Key Issues • How to measure similarity properly • Invariances (for distance measures) • How to do it fast • Early abandoning • Lowerbounding • Indexing • Hashing • etc Some highlights today More in the coming weeks
Quotation • “An ability to assess similarity lies close to the core of cognition. The sense of sameness is the very keel and backbone of our thinking. An understanding of problem solving, categorization, memory retrieval, inductive reasoning, and other cognitive processes require that we understand how humans assess similarity.” MIT Encyclopedia of the Cognitive Sciences, Cambridge, MA, MIT Press 2006, pp. 763-765
Similarity • Measuring the similarity between two objects allows: • High quality clustering • Classification using the nearest neighbor algorithm (the nearest neighbor is the most similar neighbor) • Outlier detection (anomaly detection) • Motif detection (repeated pattern detection) • Query by content (similarity search) • Similarity joins • …
Similarity vs. Distance We informally use Similarity and Distance interchangeably. Distance usually ranges from zero to infinity (or some large number), with smaller values implying the two items are more alike. Similarity usually ranges from zero to one, with smaller values implying the two items are less alike. We can convert by just taking the reciprocal (handing zero as a special case)
Metric vs. Measure We can speak of • Distance measure • Distance metric • Similarity measure • Similarity metric • However measure and metric are not the same (but many papers confuse them)
Intuitions behind desirable distance measure properties D(A,B) = D(B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D(A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D(A,B) D(A,C) + D(B,C)Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.” measure metric
Lets do a quick review of what similarity measuring let us do
Similarity: Clustering Given a unlabeled dataset, arrange them into groups by their mutual similarity Iguania Alligatoridae Crocodylidae Alligatorinae Amphisbaenia Chelonia Elseya dentata Xantusia vigilis Cricosaura typica Caiman crocodilus Phrynosoma taurus Glyptemys muhlenbergii Phrynosoma ditmarsi Phrynosoma douglassii Alligator mississippiensis Phrynosoma hernandesi Tomistoma schlegelii Crocodylus johnstoni Crocodylus cataphractus Phrynosoma braconnieri
Similarity: Nearest Neighbor Classification Basal What type of arrowhead is this? Given a labeled training set, classify future unlabeled examples Articulate For he is well placed among the fools who does not distinguish one class from another* *Paradiso -- Canto XIII 115
Similarity Joins Given two data collections, link items occurring in each We can take two different families of butterflies, Limenitidinae and Danainae, and find the most similar shape between them Danainae Limenitidinae
Euploea camaralzeman Adelpha iphiclus Harma theobene Danaus affinis Limenitis Danaus (subset) (subset) Aterica galene Aterica galene Limenitis reducta Limenitis reducta Greta Greta morgane morgane Danaus plexippus Danaus plexippus Limenitis archippus Limenitis archippus Catuna crithea Catuna crithea Tellervo zoilus Tellervo zoilus Placidina euryanassa Placidina euryanassa Limenitidinae Danainae
Why would the two most similar shapes also have similar colors and patterns? That can’t be a coincidence. This is an example of Müllerian mimicry Euploea camaralzeman Adelpha iphiclus Harma theobene Danaus affinis Limenitis Danaus (subset) (subset) Aterica galene Aterica galene Limenitis reducta Limenitis reducta Greta Greta morgane morgane Danaus plexippus Danaus plexippus Limenitis archippus Limenitis archippus Catuna crithea Catuna crithea Tellervo zoilus Tellervo zoilus Placidina euryanassa Placidina euryanassa Limenitidinae Danainae Limenitis Danaus archippus plexippus Not Batesian mimicry as commonly believed Viceroy Monarch
Similarity Annotation Given an object of interest, automatically obtain additional information about it. Friedrich Bertuch’s Bilderbuch fur Kinder (Weimar, 1798–1830) This page was published in 1821 Bilderbuch is a children’s encyclopedia of natural history, published in 237 parts over nearly 40 years in Germany. Suppose we encountered this page and wanted to know more about the insect. The back of the page says “Stockinsekt ” which we might be able to parse to “Stick Insect”, but what kind? How large is it? Where do they live? Suppose we issue a query to Google search for “Stick Insect” and further filter the results by shape similarity….
Most images returned by the Google image query “stick insect” do not segment into simple shapes, but some do, including the 296th one. It looks like our insect is a Thorny Legged Stick Insect, or Eurycantha calcarata from Southeast Asia. Note that in addition to rotation invariance our distance measure must be invariant to other differences. The real insect has a tail that extends past his legs, and asymmetric positions of limbs etc.
Similarity: Query by Content (Similarity Search) Given a large data collection, find the k most similar objects to an object of interest. Petroglyphs • They appear worldwide • Over a million in America alone • Surprisingly little known about them Petroglyphs are images incised in rock, usually by prehistoric peoples. They were an important form of pre-writing symbols, used in communication from approximately 10,000 B.C.E. to modern times. Wikipedia who so sketched out the shapes there?* .. they would strike the subtlest minds with awe* *Purgatorio -- Canto XII 6
Similarity Search Given a database C of N objects, a distance measure D(), and a query Q, find object Ci , such that D(Q,Ci) is minimized. C Q How do we do this correctly? How do we do this fast?
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 In many cases, we have real-valued features, so we think about the objects in Euclidean space, and use Euclidean distance as our “similarity” measure. However, this is not always possible, and even when it is, it is not always a good idea. Q = {2.1, 8.0} Weight Length
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 The two most common similarity search variants are: • K nearest neighbor search • Range search Q Weight • K nearest neighbor search, returns K items • Range search returns between zero items, and the full dataset Length
The two most common similarity search variants are: • K nearest neighbor search. Find me the five closest Starbucks to my office. • Range search. Find me all Starbucks within 4 miles of my office.
Brute Force Search Q = {2.1, 8.0} Sequential_Scan(Q) Algorithm Algorithm 1. 1. best_so_far best_so_far = infinity; = infinity; for for 2. 2. all sequences in database all sequences in database 3. 3. true_dist = ED(Q, Ci) if if true_dist < best_so_far true_dist < best_so_far 4. 4. 5. 5. best_so_far best_so_far = true_dist; = true_dist; 6. 6. index_of_best_match index_of_best_match = i; = i; endif endif 7. 7. 8. 8. endfor ED(Q, C1) = sqrt((2.1-2.7)2 + (8.0-5.5)2 )
Why is the Triangular Inequality so Important? Virtually all techniques to search data require the triangular inequality to hold. Suppose I am looking for the closest point to Q, in a database of 3 objects. Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database. a Q c b
Why is the Triangular Inequality so Important? Virtually all techniques to index data require the triangular inequality to hold. I find a and calculate that it is 2 units from Q, it becomes my best-so-far. I find b and calculate that it is 7.81 units away from Q. I don’t have to calculate the distance from Q to c! I know D(Q,b) D(Q,c) + D(b,c) D(Q,b) - D(b,c) D(Q,c) 7.81- 2.30 D(Q,c) 5.51 D(Q,c) So I know that c is at least 5.51 units away, but my best-so-far is only 2 units away. a Q c b
A Final Thought on the Triangular Inequality I Sometimes the triangular inequality requirement maps nicely onto human intuitions. Consider the similarity between a hippo, an elephant and a man. The hippo and the elephant are very similar, and both are very unlike the man.
A Final Thought on the Triangular Inequality II Sometimes the triangular inequality requirement fails to map onto human intuition. Consider the similarity between the horse, a man and the centaur… The horse and the man are very different, but both share many features with the centaur. This relationship does not obey the triangular inequality. This example due to Remco C. Veltkamp
Why Similarity Search • If the dataset is labeled, we can solve the nearest neighbor classification problem. • It is a sub-routine in other data mining problems (NN classification, outlier detection, motif discovery) • Allows hypothesis testing… • Allows plagiarism detection
What properties should a distance measure have? II Domain Dependent Invariances D( , ) = D( , ) D( , ) = D( , ) D( , ) = D( , ) Depending on context, for images we might want invariance to scale, offset, rotation, contrast, brightness, color, handedness, occlusion, aspect ratio, noise etc. For text we might want invariance capitalization, whitespace etc. For code…
Suppose we are walking in a cemetery in Japan. We see an interesting grave marker, and we want to learn more about it. We can take a photo of it and search a database….
Campana and Keogh (2010). A Compression Based Distance Measure for Texture. SDM 2010.
In order to do this, we must have a distance measure with the right invariances I Color invariance (unlike say, flags) Occlusion invariance Norway Iceland Size invariance
In order to do this, we must have a distance measure with the right invariances II Rotation invariance But note that rotation invariance would be a bad idea for text! p d b q
Sometimes we achieve the invariance in data preprocessing (by normalizing etc) Sometimes we achieve the invariance in the distance measure itself.
To make our discussion of similarity more concrete, we will consider just one kind of data for the next few lectures But most of these ideas apply everywhere
What are Time Series? 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 A time series is a collection of observations made sequentially in time. Virtually all similarity measurements, indexing and dimensionality reduction techniques discussed in this class can be used with other data types
Time Series are Ubiquitous! I People measure things… • Their blood pressure • Donald Trumps popularity rating • The annual rainfall in Seattle • The value of their Google stock …and things change over time… Thus time series occur in virtually every medical, scientific and businesses domain
Shooting Hand moving to shoulder level 1 Hand moving down to grasp gun Hand moving above holster Hand at rest 0.5 0 10 20 30 40 50 60 70 80 90 0 0 50 100 150 200 250 300 350 400 450 ? Lance Armstrong 400 200 0 2000 2001 2002 The Ubiquity of Time Series Don’t Shoot! Motion capture, meteorology, finance, handwriting, medicine, web logs, music…
We can convert shapes into a time series 0 200 400 600 800 1000 1200 *Paradiso -- Canto XXX, 90.
Text data, may best be thought of as time series… The local frequency of words in the Bible Blue: “God” -English Bible 5 x 10 0 0 1 2 3 4 5 6 7 8 Genesis Numbers Ezekiel Jeremiah Revelation Deuteronomy
Text data, may best be thought of as time series… The local frequency of words in the Bible Blue: “God” -English Bible Red: “Dios” -Spanish Bible 5 x 10 0 0 1 2 3 4 5 6 7 8 Genesis Numbers Ezekiel Chronicles 1 Jeremiah Revelation Deuteronomy Gray: “El Senor” -Spanish Bible
Video data, may best be thought of as time series… Steady pointing Hand moving to shoulder level Point Hand at rest 0 10 20 30 40 50 60 70 80 90 Steady pointing Hand moving to shoulder level Hand moving down to grasp gun Gun-Draw Hand moving above holster Hand at rest 0 10 20 30 40 50 60 70 80 90
1 0.5 0 0 50 100 150 200 250 300 350 400 450 Handwriting data, may best be thought of as time series… George Washington 1732-1799 George Washington Manuscript
D(Q,C) Euclidean Distance Metric Given two time series: Q = q1…qn C = c1…cn C Q Most published work in data mining uses Euclidean distance
Optimizing the Euclidean Distance Calculation Instead of using the Euclidean distance we can use the Squared Euclidean distance This optimization helps with CPU time, but most problems are I/O bound. Euclidean distance and Squared Euclidean distance are equivalent in the sense that they return the same rankings, clusterings and classifications
Preprocessing the data before distance calculations If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results This is because Euclidean distance is very sensitive to some “distortions” in the data. For most problems these distortions are not meaningful, and thus we can and should remove them In the next few slides we will discuss the 4 most common distortions, and how to remove them • Offset Translation • Amplitude Scaling • Linear Trend • Noise
Transformation I: Offset Translation 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 D(Q,C) Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 50 100 150 200 250 300
0 100 200 300 400 500 600 700 800 900 1000 Transformation II: Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) The Matlab zscore function we used removed both Offset Translation and Amplitude Scaling D(Q,C)
12 10 8 6 4 2 0 -2 5 -4 0 20 40 60 80 100 120 140 160 180 200 4 3 2 1 0 -1 -2 -3 0 20 40 60 80 100 120 140 160 180 200 Transformation III: Linear Trend The intuition behind removing linear trend is… Fit the best fitting straight line to the time series, then subtract that line from the time series. Removed linear trend Removed offset translation Removed amplitude scaling
8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Transformation IIII: Noise Q = smooth(Q) The intuition behind removing noise is... Average each datapoints value with its neighbors. C = smooth(C) D(Q,C)