780 likes | 933 Views
Indexing and Data Mining in Multimedia Databases. Christos Faloutsos CMU www.cs.cmu.edu/~christos. Outline. Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources. Problem.
E N D
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resources C. Faloutsos
Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: • Allow fast, approximate queries, and • Find rules/patterns C. Faloutsos
Sample queries • Similarity search • Find pairs of branches with similar sales patterns • find medical cases similar to Smith's • Find pairs of sensor series that move in sync C. Faloutsos
Sample queries –cont’d • Rule discovery • Clusters (of patients; of customers; ...) • Forecasting (total sales for next year?) • Outliers (eg., fraud detection) C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses C. Faloutsos
Indexing - Multimedia Problem: • given a set of (multimedia) objects, • find the ones similar to a desirable query object (quickly!) C. Faloutsos
$price $price $price 1 1 1 365 365 365 day day day distance function: by expert C. Faloutsos
‘GEMINI’ - Pictorially eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg off-the-shelf S.A.Ms (spatial Access Methods) 1 365 day C. Faloutsos
fast; ‘correct’ (=no false dismissals) used for images (eg., QBIC) (2x, 10x faster) shapes (27x faster) video (eg., InforMedia) time sequences ([Rafiei+Mendelzon], ++) ‘GEMINI’ C. Faloutsos
Remaining issues • how to extract features automatically? • how to merge similarity scores from different media C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • Visualization: Fastmap • Relevance feedback: FALCON • Data Mining / Fractals • Conclusions C. Faloutsos
~100 ~1 FastMap ?? C. Faloutsos
FastMap • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time • We want a linear algorithm: FastMap [SIGMOD95] C. Faloutsos
Applications: time sequences • given n co-evolving time sequences • visualize them + find rules [ICDE00] DEM rate JPY HKD time C. Faloutsos
Applications - financial • currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) C. Faloutsos
FRF DEM HKD JPY USD GBP Applications - financial • currency exchange rates [ICDE00] USD(t) USD(t-5) C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • Visualization: Fastmap • Relevance feedback: FALCON • Data Mining / Fractals • Conclusions C. Faloutsos
Merging similarity scores • eg., video: text, color, motion, audio • weights change with the query! • solution 1: user specifies weights • solution 2: user gives examples • and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) • but: how about disjunctive queries? C. Faloutsos
DEMO demo server C. Faloutsos
‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks C. Faloutsos
‘FALCON’ Vs Inverted Vs average: is flat! C. Faloutsos
“Single query point” methods std + + + x + + + avg Rocchio C. Faloutsos
+ + + + + + + + + + + + “Single query point” methods + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... C. Faloutsos
Main idea: FALCON Contours [Wu+, vldb2000] + + feature2 eg., std + + + feature1 (eg., avg) C. Faloutsos
+ + + + + A: Aggregate Dissimilarity • : parameter (~ -5 ~ ‘soft OR’) x g1 g2 C. Faloutsos
converges quickly (~5 iterations) good precision/recall is fast (can use off-the-shelf ‘spatial/metric access methods’) FALCON C. Faloutsos
Conclusions for indexing + visualization • GEMINI: fast indexing, exploiting off-the-shelf SAMs • FastMap: automatic feature extraction in O(N) time • FALCON: relevance feedback for disjunctive queries C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses C. Faloutsos
Data mining & fractals – Road map • Motivation – problems / case study • Definition of fractals and power laws • Solutions to posed problems • More examples C. Faloutsos
Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) • - ‘spiral’ and ‘elliptical’ galaxies • (stores & households; healthy & ill subjects) • - patterns? (not Gaussian; not uniform) • attraction/repulsion? • separability?? C. Faloutsos
Problem#2: dim. reduction mpg • given attributes x1, ... xn • possibly, non-linearly correlated • drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) engine size C. Faloutsos
Answer: • Fractals / self-similarities / power laws C. Faloutsos
What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... C. Faloutsos
Definitions (cont’d) • Paradox: Infinite perimeter ; Zero area! • ‘dimensionality’: between 1 and 2 • actually: Log(3)/Log(2) = 1.58… (long story) C. Faloutsos
Q: fractal dimension of a line? Intrinsic (‘fractal’) dimension Eg: #cylinders; miles / gallon C. Faloutsos
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 Intrinsic (‘fractal’) dimension C. Faloutsos
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) Intrinsic (‘fractal’) dimension C. Faloutsos
log(#pairs within <=r ) 1.58 log( r ) Sierpinsky triangle == ‘correlation integral’ C. Faloutsos
Observations self-similarity -> • <=> fractals • <=> scale-free • <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) C. Faloutsos
Road map • Motivation – problems / case studies • Definition of fractals and power laws • Solutions to posed problems • More examples • Conclusions C. Faloutsos
Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) • clusters? • separable? • attraction/repulsion? • data ‘scrubbing’ – duplicates? C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos
Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos
r1 r2 r2 r1 spatial d.m. Heuristic on choosing # of clusters C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) • - 1.8 slope • - plateau! • repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) C. Faloutsos
Problem #2: Dim. reduction C. Faloutsos
Solution: • drop the attributes that don’t increase the ‘partial f.d.’ PFD • dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] C. Faloutsos
Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 C. Faloutsos