300 likes | 473 Views
Multimedia DBs. PAA and APCA. Another approach: segment the time series into equal parts, store the average value for each part. Use an index to store the averages and the segment end points. X. X. X. X'. X'. X'. SVD. DFT. DWT. eigenwave 0. 0. Haar 0. eigenwave 1. 1. 0. 0. 0.
E N D
PAA and APCA • Another approach: segment the time series into equal parts, store the average value for each part. • Use an index to store the averages and the segment end points
X X X X' X' X' SVD DFT DWT eigenwave 0 0 Haar 0 eigenwave 1 1 0 0 0 20 20 20 80 80 80 100 100 100 40 40 40 140 140 140 60 60 60 120 120 120 Haar 1 2 eigenwave 2 Haar 2 3 eigenwave 3 Haar 3 4 eigenwave 4 5 Haar 4 6 eigenwave 5 Haar 5 7 eigenwave 6 Haar 6 eigenwave 7 Haar 7 Feature Spaces Korn, Jagadish, Faloutsos 1997 Chan & Fu 1999 Agrawal, Faloutsos, Swami 1993
sv6 sv1 value axis sv7 sv5 sv4 sv2 sv3 sv8 time axis Piecewise Aggregate Approximation (PAA) Original time series (n-dimensional vector) S={s1, s2, …, sn} n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svn’} PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)
sv6 sv1 sv7 sv5 sv4 sv2 sv3 sv8 Adaptive Piecewise Constant Approximation (APCA) sv3 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2) sv1 sv2 sv4 sr1 sr2 sr3 sr4 Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svN}
Reconstruction error PAAReconstruction error APCA APCA approximates original signal better than PAA Improvement factor = 3.77 1.69 1.21 1.03 3.02 1.75
APCA Representation can be computed efficiently • Near-optimal representation can be computed in O(nlog(n)) time • Optimal representation can be computed in O(n2M) (Koudas et al.)
Exact (Euclidean) distance D(Q,S) S Q S S Q Q’ DLB(Q’,S) D(Q,S) D(Q,S) DLB(Q’,S) Distance Measure Lower bounding distance DLB(Q,S)
R1 R1 R3 R2 R4 S2 S5 S3 R3 S1 S4 S6 R4 R2 R3 R2 S8 R4 S9 S8 S7 S9 S1 S2 S3 S4 S5 S6 S7 2M-dimensional APCA space Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)
MINDIST(Q,R2) MINDIST(Q,R3) R1 S5 S2 R3 S3 S1 S4 Q S6 MINDIST(Q,R4) R2 S8 R4 S9 S7 k-nearest neighbor Algorithm • For any node U of the index structure with MBR R, MINDIST(Q,R) £ D(Q,S) for any data item S under U
smax3 smax1 smax2 smax4 smin1 smin3 smin2 smin4 Index Modification for MINDIST Computation APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } R1 S2 S5 sv3 R3 S3 S1 S6 S4 sv1 R2 S8 R4 sv2 S9 sv4 S7 sr2 sr3 sr1 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }
REGION 2 H= { h1, h2, h3, h4 , h5, h6 } h3 value axis l3 h1 l1 h5 REGION 3 l5 REGION 1 l2 l4 h4 l6 h2 h6 L= { l1, l2, l3, l4 , l5, l6 } time axis MBR Representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN }and H= { h1, h2, …, h(N-1), hN }
REGION i h(2i-1) l(2i-1) h2i l(2i-2)+1 REGION 2 h3 l3 h1 value axis REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis Regions M regions associated with each MBR; boundaries of ith region:
t1 t2 Regions • ith region is active at time instant t if it spans across t • The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 h3 value axis l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis
t1 MINDIST(Q,R) = MINDIST Computation For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U
Approximate Search • A simpler definition of the distance in the feature space is the following: • But there is one problem… what? DLB(Q’,S)
Multimedia dbs • A multimedia database stores also images • Again similarity queries (content based retrieval) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help!
Images - color what is an image? A: 2-d array
Images - color Color histograms, and distance function
Images - color Mathematically, the distance function is:
Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color
possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color
Images - color time performance: seq scan w/ avg RGB selectivity
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes
Performance: ~10x faster Images - shapes log(# of I/Os) all kept # of features kept