Multimedia DBs

Multimedia DBs

Multimedia dbs • A multimedia database stores text, strings and images • Similarity queries (content based retrieval) • Given an image find the images in the database that are similar (or you can “describe” the query image) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help!

Image Features • Features extracted from an image are based on: • Color distribution • Shapes and structure • …..

Images - color what is an image? A: 2-d RGB array

Images - color Color histograms, and distance function

Images - color Mathematically, the distance function between a vector x and a query q is: D(x, q) = (x-q)T A (x-q) = S aij (xi-qi) (xj-qj) A=I ?

Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color

possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color

Images - color time performance: seq scan w/ avg RGB selectivity

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes

Performance: ~10x faster Images - shapes log(# of I/Os) all kept # of features kept

Dimensionality Reduction • Many problems (like time-series and image similarity) can be expressed as proximity problems in a high dimensional space • Given a query point we try to find the points that are close… • But in high-dimensional spaces things are different!

Effects of High-dimensionality • Assume a uniformly distributed set of points in high dimensions [0,1]d • Let’s have a query with length 0.1 in each dimension  query selectivity in 100-d 10-100 • If we want constant selectivity (0.1) the length of the side must be ~1!

Effects of High-dimensionality • Surface is everything! • Probability that a point is closer than 0.1 to a (d-1) dimensional surface • D=2 0.36 • D = 10 ~1 • D=100 ~1

Effects of High-dimensionality • Number of grid cells and surfaces • Number of k-dimensional surfaces in a d-dimensional hypercube • Binary partitioning  2d cells • Indexing in high-dimensions is extremely difficult “curse of dimensionality”

Dimensionality Reduction • The main idea: reduce the dimensionality of the space. • Project the d-dimensional points in a k-dimensional space so that: • k << d • distances are preserved as well as possible • Solve the problem in low dimensions • (the GEMINI idea of course…)

DR requirements • The ideal mapping should: • Be fast to compute: O(N) or O(N logN) but not O(N2) • Preserve distances leading to small discrepancies • Provide a fast algorithm to map a new query (why?)

MDS (multidimensional scaling) • Input: a set of N items, the pair-wise (dis) similarities and the dimensionality k • Optimization criterion: stress = (ij(D(Si,Sj) - D(Ski, Skj) )2 / ijD(Si,Sj) 2) 1/2 • where D(Si,Sj) be the distance between time series Si, Sj, and D(Ski, Skj) be the Euclidean distance of the k-dim representations • Steepest descent algorithm: • start with an assignment (time series to k-dim point) • minimize stress by moving points

MDS • Disadvantages: • Running time is O(N2), because of slow convergence • Also it requires O(N) time to insert a new point, not practical for queries

FastMap[Faloutsos and Lin, 1995] • Maps objects to k-dimensional points so that distances are preserved well • It is an approximation of Multidimensional Scaling • Works even when only distances are known • Is efficient, and allows efficient query transformation

FastMap • Find two objects that are far away • Project all points on the line the two objects define, to get the first coordinate

FastMap - next iteration

Results Documents /cosine similarity -> Euclidean distance (how?)

Multimedia DBs

Multimedia DBs

Presentation Transcript

DBS 5048

DBS Development

NoSQL DBs

DBS Full Storyboard

Alterative DBs

Column-based dbs

DBS Cases

system IS422ABC@dbs

Multimedia DBs

DBS Program Presentation

Multimedia DBs

Relational DBs

DBS UXI Strategy

DBS Full Storyboard

DBS UPDATE

Multimedia DBs

DBS Residential