640 likes | 829 Views
Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications. Pavel Zezula Faculty of Informatics Masaryk University, Brno. Outline of the talk. Why similarity Principles of metric similarity searching The MUFIN approach Demo applications Future directions.
E N D
Multi Feature Indexing Network MUFINSimilarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno MUFIN: Multi Feature Indexing Network
Outline of the talk • Why similarity • Principles of metric similarity searching • The MUFIN approach • Demo applications • Future directions MUFIN: Multi Feature Indexing Network
Real-Life MotivationThe social psychology view • Any event in the history of organism is, in a sense, unique. • Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. • Similarity (proximity, resemblance, communality, representativeness, psychologicaldistance, etc.) is fundamental to theories of perception, learning, judgment, etc. MUFIN: Multi Feature Indexing Network
Contemporary Networked MediaThe digital data view • Almost everything that we see, read, hear, write, measure, or observe can be digital. • Users autonomouslycontribute to production of global media and the growth is exponential. • Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. • The elements of networked media are related by numerous multi-facet links of similarity. MUFIN: Multi Feature Indexing Network
Examples with Similarity • Does the computer disk of a suspected criminal contain illegal multimedia material? • What are the stocks with similar price histories? • Which companies advertise their logos in the direct TV transmission of football match? • Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past? MUFIN: Multi Feature Indexing Network
Challenge • Networked media is getting close to the human “fact-bases” • the gap between physical and digital has blurred • Similaritydatamanagement is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. WHY? It is the similarity which is in the world revealing. MUFIN: Multi Feature Indexing Network
Limitations: Data Types We have • Attributes • Numbers, strings, etc. • Text (text-based) • Documents, annotations We need • Multimedia • Image, video, audio • Security • Biometrics • Medicine • EKG, EEG, EMG, EMR, CT, etc. • Scientific data • Biology, chemistry, physics, life sciences, economics • Others • Motion, emotion, events, etc. MUFIN: Multi Feature Indexing Network
Limitations: Models of Similarity We have • Simple geometric models, typically vector spaces We need • More complex model • Non metric models • Asymmetric similarity • Subjective similarity • Context aware similarity • Complex similarity • Etc. MUFIN: Multi Feature Indexing Network
Limitations: Queries We have • Simple query • Nearest neighbor • Range We need • More query types • Reverse NN, distinct NN, similarity join • Other similarity-based operations • Filtering, classification, event detection, clustering, etc. • Similarity algebra • May become the basis of a “Similarity Data Management System” MUFIN: Multi Feature Indexing Network
Limitations: Implementation Strategies We have • Centralized or parallel processing We need • Scalable and distributed architectures • MapReduce like approaches • P2P architectures • Cloud computing • Self-organized architectures • Etc. MUFIN: Multi Feature Indexing Network
Search Strategy Evolution Scalability • data volume - exponential • number of users (queries) • variety of data types • multi-lingual, -feature –modal queries well established cutting-edge research high Determinism exact match ► similarity precise ► approximate same answer ► good answer; recommendation fixed query ► personalized; context aware fixed infrastr. ► dynamic mapping; mobile dev. peer-to-peer centralized parallel self-organized distributed grade low MUFIN: Multi Feature Indexing Network
Similarity Data Management System findability modelling infrastructure retrieval stimuli matching extraction Similarity Data Management System similarity effectiveness efficiency execution evaluation algebra MUFIN: Multi Feature Indexing Network
Metric Search Grows in Popularity Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2006 MUFIN: Multi Feature Indexing Network
SEARCH data & queries index structure infrastructure The MUFIN Approach MUFIN: MUlti-Feature Indexing Network Extensibility metric space Scalability P2P structure Independence Infrastructure as a service MUFIN: Multi Feature Indexing Network
Extensibility: Metric Abstraction of Similarity • Metric space:M = (D,d) • D– domain • distance function d(x,y) x,y,z D • d(x,y) > 0 - non-negativity • d(x,y) = 0 x = y - identity • d(x,y) = d(y,x) - symmetry • d(x,y)≤ d(x,z)+ d(z,y) - triangle inequality MUFIN: Multi Feature Indexing Network
Examples of Distance Functions • LpMinkovski distance (for vectors) • L1 – city-block distance • L2 – Euclidean distance • L¥– infinity • Edit distance (for strings) • minimal number of insertions, deletions and substitutions • d(‘application’, ‘applet’) = 6 • Jaccard’s coefficient (for sets A,B) MUFIN: Multi Feature Indexing Network
Examples of Distance Functions • Mahalanobisdistance • for vectors with correlated dimensions • Hausdorff distance • for sets with elements related by another distance • Earth movers distance • primarily for histograms (sets of weighted features) • and many others MUFIN: Multi Feature Indexing Network
Similarity Search Problem • For XDin metric space M, pre-process Xso that the similarity queries are executed efficiently. No total ordering exists! MUFIN: Multi Feature Indexing Network
Similarity Queries • Range query • Nearest neighbor query • Similarity join • Combined queries • Complex queries MUFIN: Multi Feature Indexing Network
q r Similarity Range Query • range query • R(q,r) = { x X| d(q,x)≤ r } … all museums up to 2km from my hotel … MUFIN: Multi Feature Indexing Network
q Nearest Neighbor Query • the nearest neighbor query • NN(q) = x • x X, "y X, d(q,x)≤ d(q,y) • k-nearest neighbor query • k-NN(q,k) = A • A X, |A| = k • x A, y X – A, d(q,x)≤ d(q,y) … five closest museums to my hotel … k=5 MUFIN: Multi Feature Indexing Network
m Similarity Join Queries • similarity join of two data sets • similarity self join X = Y …pairs of hotels and museums which are five minutes walk apart … MUFIN: Multi Feature Indexing Network
Combined Queries • Range + Nearest neighbors • Nearest neighbor + similarity joins • by analogy MUFIN: Multi Feature Indexing Network
Complex Queries • Find the best matches of circularshape objects with redcolor • The best match for circular shape or red color needs not be the best match combined • A0 algorithm • Threshold algorithm MUFIN: Multi Feature Indexing Network
Partitioning Principles • Given a set XD in M=(D,d), basic partitioning principles have been defined: • Ball partitioning • Generalized hyper-plane partitioning • Excluded middle partitioning • Clustering MUFIN: Multi Feature Indexing Network
dm p Ball Partitioning • Inner set: { x X| d(p,x)≤ dm } • Outer set: { x X| d(p,x) > dm } MUFIN: Multi Feature Indexing Network
p2 p1 Generalized Hyper-plane • { x X| d(p1,x)≤d(p2,x) } • { x X| d(p1,x) >d(p2,x) } MUFIN: Multi Feature Indexing Network
2r dm dm p p Excluded Middle Partitioning • Inner set: { x X| d(p,x)≤ dm - } • Outer set: { x X| d(p,x) > dm + } • Excluded set: otherwise MUFIN: Multi Feature Indexing Network
Clustering • Cluster data into sets • bounded by a ball region • { x X| d(pi,x)≤ ric } MUFIN: Multi Feature Indexing Network
Scalability: Peer-to-Peer Indexing • Local search: M-tree, D-Index, M-Index • Native metric techniques: GHT*, VPT* • Transformation techniques: M-CAN, M-Chord MUFIN: Multi Feature Indexing Network
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997] 1) Paged organization 2) Dynamic 3) Suitable for arbitrary metric spaces 4) I/O and CPU optimization - computing d can be time-consuming MUFIN: Multi Feature Indexing Network
C A B A E B C D E F D F quadratic form L1 (city-block) weighted-Euclidean L (max-metric) The M-tree Idea Metric: L2 (Euclidean) • Depending on the metric, the “shape” of index regions changes MUFIN: Multi Feature Indexing Network
o3 o2 o9 o1 o4 o6 o11 o5 o10 o7 o8 1.0 1.3 1.2 0.0 0.0 1.4 2.9 0.0 0.0 0.0 1.6 o10 o7 o7 o2 o10 o4 o1 o2 o1 1.6 1.3 1.4 4.5 6.9 1.2 2.9 -.- -.- 0.0 5.3 3.8 0.0 3.3 o2 M-tree: Example o5 o11 o3 o8 o1 Covering radius o6 o4 o9 Distance to parent Distance to parent Distance to parent Leaf entries Distance to parent MUFIN: Multi Feature Indexing Network
M-tree family • Bulk loading • Slim-tree • Multi-way insertion • PM-tree • M2-tree • etc. MUFIN: Multi Feature Indexing Network
D-Index [Dohnal, Gennaro, Zezula, MTA 2002] 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure MUFIN: Multi Feature Indexing Network
D-index: Insertion MUFIN: Multi Feature Indexing Network
r r r r r r q q q q q q D-index: Range Search MUFIN: Multi Feature Indexing Network
Implementation Postulates of Distributed Indexes • dynamism– nodes can be added and removed • no hot-spots – no centralized nodes, no flooding by messages (transactions) • update independence – network update at one site does not require an immediate change propagation to all the other sites MUFIN: Multi Feature Indexing Network
DistributedSimilarity Search Structures • Native metric structures: • GHT* (Generalized Hyperplane Tree) • VPT* (Vantage Point Tree) • Transformation approaches: • M-CAN (Metric Content Addressable Network) • M-Chord (Metric Chord) MUFIN: Multi Feature Indexing Network
p5 p2 p1 p5 p3 p2 p6 p4 p6 p3 p1 p4 GHT* Address Search Tree • Based on the Generalized Hyperplane Tree [Uhl91] • two pivots for binary partitioning MUFIN: Multi Feature Indexing Network
p1 p5 p3 p2 p6 p4 BID1 BID2 BID3 NNID2 Peer 2 GHT* Address Search Tree • Inner node • two pivots (reference objects) • Leaf node • BID pointer to a bucket if data stored on the current peer • NNID pointer to a peer if data stored on a different peer MUFIN: Multi Feature Indexing Network
Peer 1 Peer 3 Peer 2 GHT* Address Search Tree MUFIN: Multi Feature Indexing Network
q p2 p1 p3 p5 p1 p5 p2 p6 p4 p2 p6 p5 BID1 BID2 BID3 BID3 NNID2 NNID2 p6 r p1 Peer 2 Peer 2 GHT* Range Query • Range query R(q,r) • traverse peer’s own AST • search buckets for all BIDs found • forward query to all NNIDs found p3 p4 MUFIN: Multi Feature Indexing Network
p1 p1 p2 p2 p3 p3 p4 p4 p5 p6 p7 p7 p8 p8 p9 p10 p11 p12 p13 p14 NNID3 NNID3 BID1 BID1 NNID2 NNID4 NNID5 NNID5 NNID6 NNID7 NNID8 AST: Logarithmic replication • Full AST on every peer is space consuming • replication of pivots grows in a linear way • Store only a part of the AST: • all paths to local buckets • Deleted sub-trees: • replaced by NNIDof the leftmost peer MUFIN: Multi Feature Indexing Network
p1 p2 p3 p4 p7 p8 BID1 AST: Logarithmic Replication (cont.) • Resulting tree • replication of pivots grows in a logarithmic way p1 p2 p3 p4 NNID5 p7 p8 NNID3 BID1 NNID2 MUFIN: Multi Feature Indexing Network
p1 (r1) r1 p2 (r2) p3 (r3) r3 r2 p1 p3 p2 VPT* Structure • Similar totheGHT* - ball partitioning is used for AST Based on theVantage Point Tree [Yia93] • inner nodes have one pivot and a radius • different traversing conditions MUFIN: Multi Feature Indexing Network
M-Chord: The Metric Chord • Transform metric space to one-dimensional domain • Use M-Index -a generalized version of theiDistance • Divide the domain into intervals • assign each interval to a peer • Use the Chord P2P protocol for navigation • The Skip graphs distributed protocol can be used, alternatively MUFIN: Multi Feature Indexing Network
M-Chord: Indexing the Distance • iDistance – indexing technique for vector domains • cluster analysis = centers = reference pointspi • assign iDistancekeys to objects • range query R(q,r): identify intervals of interest • Generalization to metric spaces • select pivots • then partition: Voronoi-style MUFIN: Multi Feature Indexing Network
M-Chord: Chord Protocol • Peer-to-Peer navigation protocol • Peers are responsible for intervals of keys • hops to localize a node storing a key • M-Chord • set the iDistancedomain • make it uniform: function h • Use Chord on this domain MUFIN: Multi Feature Indexing Network
M-Chord: Range Query • Node Nq initiates thesearch • Determine intervals • generalized iDistance • Forward requests to peers on intervals • Search in the nodes • using local organization • Merge the received partial answers MUFIN: Multi Feature Indexing Network