Operators for Similarity Search

Operators for Similarity Search Deepak Padmanabhan, PhD Centre for Data Sciences and Scalable Computing The Queen’s University of Belfast United Kingdom deepaksp@acm.org

Similarity Search in Action Image Search Web Pages Similar Movies (Tastekid.com)

Similarity and Cognition

Similarity and Cognition This sense of sameness is the very keel and backbone of our thinking. Principles of Psychology William James, 1890 … the mind makes continual use of the notion of sameness, and if deprived of it, would have a different structure from what it has.

Similarity and Cognition Similarity, is fundamental for learning, knowledge and thought, for only our sense of similarity allows us to order things into kinds so that these can function as stimulus meanings. Reasonable expectation depends on the similarity of circumstances and on our tendency to expect that similar causes will have similar effects. Ontological Relativity and Other Essays Quine, 1969

Geometric Similarity Model O1 O1 O2 O2

Diagnosticity Principle Similarity and Grouping are related Features that are used to cluster have disproportionate influence

Pair wise similarities:Object Representation and Similarity Measures

Example Representations

Estimating Similarity Between Objects Domain Ontology Text Similarity CAR1023 ---------- Remarks: Good condition Model: Passat V6 Year: 2002 Battery Voltage: 12.9V … … … … CAR560 ---------- Remarks: Nice condition Model: Passat Year: 2000 Battery Voltage: 12.6V … … … … 0.60 0.80 0.75 0.90 Domain Knowledge+ Numeric Numeric min2 min avg noagg max 0.60 {0.6,0.8,0.75,0.9} 0.75 0.90 0.76

Outline for the Rest • Construction-based Classification • Property-based Classification • Some Directions

Problem Overview S(Q,D) D(Q,D) Q D q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn I(Q,D’) Member- Ship/Score I(Q,D) Scoring Aggregation Filter Query Parameters Scoring Operators: Assign a score vector to each Object, by comparing to the query object Aggregation operators: Aggregate the score vector into smaller number of values Selection/Filter Operators: Select a subset of objects based on whether they satisfy a criterion, e.g., Skyline, rank based or threshold based

Common Operations • Aggregation operations • Weighted Sum • Max • Min • Distance • N-Match • Filter operations • Skyline • Rank (Top-k) • Threshold (Bounding Box, Range query) Different combinations lead to different operators

Weighted Sum Top-k W(X) = 1 W(Y) = 2 Top-k Filter: Sort and Choose k d((2,2)) = 1*1 + 2*2 = 5 d((1,3)) = 1*2 + 2*1 = 4 d((1,4)) = 1*2 + 2*0 = 2 d((5,1)) = 1*2 + 2*3 = 8 d((3,3)) = 1*0 + 2*1 = 2 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 d((1,4)) = 1*2 + 2*0 = 2 d((3,3)) = 1*0 + 2*1 = 2 d((1,3)) = 1*2 + 2*1 = 4 d((2,2)) = 1*1 + 2*2 = 5 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 d((5,1)) = 1*2 + 2*3 = 8 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Locus w/ Eq. Weights Useful when all attributes need to be considered

Max Top-k d((2,2)) = 2 d((1,3)) = 2 d((1,4)) = 2 d((5,1)) = 3 d((3,3)) = 1 d((6,3)) = 3 d((2,6)) = 2 d((5,6)) = 2 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Useful when maximum dissimilarity needs to be bounded Locus

Min Top-k d((2,2)) = 1 d((1,3)) = 1 d((1,4)) = 0 d((5,1)) = 2 d((3,3)) = 0 d((6,3)) = 1 d((2,6)) = 1 d((5,6)) = 2 (2,6) (5,6) (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1) Useful when best matching attribute is sufficient Locus

Skyline Domination’ Region of (1,4) Domination’ Region of (2,6) Domination’ Region of (5,6) An object is said to dominate another if the latter is farther away from the query than the former on “all” dimensions (can be equal on some, but not all) (2,6) (5,6) (1,4) Q:(3,4) Domination’ Region of (3,3) (3,3) (1,3) (6,3) (2,2) (5,1) All objects that are not dominated by any other are output as results. Results: (5,6), (2,6), (3,3), (1,4) Useful when attribute scores cannot be aggregated

Range Query L2 aggregation + Threshold filter (2,6) (5,6) r (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1)

Bounding Box ry (2,6) (5,6) null aggregation + Threshold filter rx (1,4) Q:(3,4) (3,3) (1,3) (6,3) (2,2) (5,1)

K-N-Match (2,6) (4,5) (1,4) Q:(3,4) (1,3) (6,3) (2,2) (3,2) (5,1) N=1 K-N-Match operator ranks objects based on the match on the nth best matching attribute. N=2 N-match aggregation + Rank filter Useful when at least N attributes should match

Summarizing the Construction-based Classification

Property-based Classification • Ordered vs. Unordered Output • Whether there is an ordering in the output result set • Subset vs. All Attributes • Whether all attributes contribute to deciding the membership in the result set

Ordered vs. Unordered Output Applicable to Selection/Filter operators Skyline Top-k 3 R 2 R 1 R Query Query

Subset vs. All Attributes Applicable to Aggregation operators S(Q,D) D(Q,D) Q D q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn I(Q,D’) Member- Ship/Score I(Q,D) Scoring Aggregation Filter Query Parameters We focus on the construction of I(Q,D) for this classification.

Some Example I(Q,D)s • Weighted Sum • Range Query • Bounding Box/Skyline • Max • Min • K-N-Match All Attributes Needed “Some” Attributes Enough

Classification Overview Aggregation Selection/ Filter

“Add-on” Features for Similarity Operators • Indirection (Reverse Operators) • Multiple Queries • Diversity • Visibility • Subspaces • Typed Data (Chromaticity)

Reverse Operators • Range Query: Get me all the restaurants within 1km of my home • This is common in consumer usage scenario • E.g., user searching for restaurants to dine • Reverse Range Query: Get me all the users for whom my restaurant is within 1km • This is more of a service provider question • E.g., Finding potential consumers to whom targeted marketing may be done • This reversal could be done in various operators • E.g., Reverse Skyline, Reverse kNN, …

Multiple Queries Restaurants/Pubs I plan to leave from office, go to the club and then get home. I need to get some dinner somewhere during this travel. Give me restaurants or pubs that are within 1km of any of these three locations. Home Club Office This corresponds to a range query using multiple query points. The merging operator here is the OR operator, since we would be content with places that are close to any one of these queries.

Diversity The logical 3 nearest neighbors aren’t very diverse and are very similar to each other. Rating Diversity constraint makes sure that the pairwise distance between any two results is lower bounded. Thus, it will return a more diverse set. Cost

Visibility Constraints Return k Nearest neighbours that are visible from the query point (d6) (d4) (d1) Q K = 3 (d8) (d2) (d5) KNN = {d4, d5, d6} (d3) VkNN = {d4, d1, d2} (d7)

Subspaces: Subspace Range Search Find objects within a threshold distance in a user specified subset of dimensions Dimensions = {Rating} R = {d1, d2, d4, d5, d6, d8} Dimensions = {Expense, Rating} R = {d4, d5, d6} (d6) (d4) Rating (d1) Q (d8) (d2) (d5) Dimensions = {Expense} R = {d4, d5, d6} (d3) (d7) Expense

Typed Data: Chromaticity Find objects (of class A) that have the query object (of class B) in its kNN result set Example: people and restaurants Find bi-chromatic rKNN set of a restaurant (p4) (p3) RNN(r1) = {p2, r3} (p1) (r2) Bi-RNN(r1) = {p2, p1} (p5) (r3) (r1) Bi-RNN(r3) = {p6} (p2) (p6) Two classes P and R. Query is from class R, results from class P

Summary of operators

The Road Ahead • Plethora of choices in each step leads to the large variety of similarity search operators • And keeps researchers busy • Choices in • Similarity measures • Aggregation operators • Selection/filter operators • Additional features • Algorithmic features • Are we done yet?

Let us invent some new Operators

N-Match-BB • Bounding Box query where at least N attribute bounds are satisfied • An adaptation of K-N-Match to Bounding Boxes Unordered Subset of Attrs Q For 1-Match-BB, data points on either of these rectangles are OK.

Multi-Query Bichromatic Reverse kNN • Combination of • Weighted Sum • Top-k Filter • Reverse (Indirection) • Multi-Query • Chromaticity • Example Use Case: Of the three chosen locations for Café X (all three are intended to be opened), find people who would find at least one of these locations among the k closest cafes

Miscellaneous • Revisiting algorithms on new platforms • Hadoop/MR • Interpretability in Results • Can results of similarity search be shown in a manner so that the intuitive similarity between the query and the result be highlighted? • Syntactic and Semantic Features • Understand the dichotomy between syntactic (e.g., shape similarity) and semantic (e.g., two images being similar due to both being maps) • Would modeling them differently and learning when to weigh each highly lead to more efficient similarity search • Contextual Similarity; conditioning on user history • On searching for “IBM Watson”, a travelling person should be shown IBM Watson Labs, whereas a technologist should be shown the IBM Watson system

“Similarity lies in the eyes of the beholder”* Thank You!Questions/Comments? deepaksp@acm.org deepakp7@gmail.com * (Adapted from famous quote) from http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt

Operators for Similarity Search

Operators for Similarity Search

Presentation Transcript

Data-dependent Hashing for Similarity Search

Seeds for Similarity Search

Geometry of Similarity Search

Similarity Search in Visual Data

A Metric Cache for Similarity Search

Distributed Spatio-Temporal Similarity Search

Similarity Search in Protein Databases

User Oriented Trajectory Similarity Search

A General Algorithm for Subtree Similarity-Search

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition

Distributed Spatio-Temporal Similarity Search

Database Similarity Search

Sequence Similarity Search: an Overview

Similarity Search for Web Services

Cache-Conscious Performance Optimization for Similarity Search

Connected Substructure Similarity Search

Similarity Search in Arbitrary Subspaces

Similarity Search

Probabilistic Similarity Search for Uncertain Time Series

Content-Based Similarity Search

Internet Search Operators

Search operators insider