370 likes | 515 Views
Multimedia DBs. 29. 28. 27. 26. 25. 24. 23. 0. 50. 100. 150. 200. 250. 300. 350. 400. 450. 500. Time Series Data. A time series is a collection of observations made sequentially in time. 25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250
E N D
29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 Time Series Data A time series is a collection of observations made sequentially in time. 25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 value axis time axis
PAA and APCA • Feature extraction for GEMINI: • Fourier • Wavelets • Another approach: segment the time series into equal parts, store the average value for each part. • Use an index to store the averages and the segment end points
X X X X' X' X' SVD DFT DWT eigenwave 0 0 Haar 0 eigenwave 1 1 0 0 0 20 20 20 80 80 80 100 100 100 40 40 40 140 140 140 60 60 60 120 120 120 Haar 1 2 eigenwave 2 Haar 2 3 eigenwave 3 Haar 3 4 eigenwave 4 5 Haar 4 6 eigenwave 5 Haar 5 7 eigenwave 6 Haar 6 eigenwave 7 Haar 7 Feature Spaces Korn, Jagadish, Faloutsos 1997 Chan & Fu 1999 Agrawal, Faloutsos, Swami 1993
sv6 sv1 value axis sv7 sv5 sv4 sv2 sv3 sv8 time axis Piecewise Aggregate Approximation (PAA) Original time series (n-dimensional vector) S={s1, s2, …, sn} n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svn’} PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)
sv6 sv1 sv7 sv5 sv4 sv2 sv3 sv8 Adaptive Piecewise Constant Approximation (APCA) sv3 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2) sv1 sv2 sv4 sr1 sr2 sr3 sr4 Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svN}
Reconstruction error PAAReconstruction error APCA APCA approximates original signal better than PAA Improvement factor = 3.77 1.69 1.21 1.03 3.02 1.75
APCA Representation can be computed efficiently • Near-optimal representation can be computed in O(nlog(n)) time • Optimal representation can be computed in O(n2M) (Koudas et al.)
Exact (Euclidean) distance D(Q,S) S Q S S Q Q’ DLB(Q’,S) D(Q,S) D(Q,S) DLB(Q’,S) Distance Measure Lower bounding distance DLB(Q,S)
R1 R1 R3 R2 R4 S2 S5 S3 R3 S1 S4 S6 R4 R2 R3 R2 S8 R4 S9 S8 S7 S9 S1 S2 S3 S4 S5 S6 S7 2M-dimensional APCA space Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)
MINDIST(Q,R2) MINDIST(Q,R3) R1 S5 S2 R3 S3 S1 S4 Q S6 MINDIST(Q,R4) R2 S8 R4 S9 S7 k-nearest neighbor Algorithm • For any node U of the index structure with MBR R, MINDIST(Q,R) £ D(Q,S) for any data item S under U
smax3 smax1 smax2 smax4 smin1 smin3 smin2 smin4 Index Modification for MINDIST Computation APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } R1 S2 S5 sv3 R3 S3 S1 S6 S4 sv1 R2 S8 R4 sv2 S9 sv4 S7 sr2 sr3 sr1 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }
REGION 2 H= { h1, h2, h3, h4 , h5, h6 } h3 value axis l3 h1 l1 h5 REGION 3 l5 REGION 1 l2 l4 h4 l6 h2 h6 L= { l1, l2, l3, l4 , l5, l6 } time axis MBR Representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN }and H= { h1, h2, …, h(N-1), hN }
REGION i h(2i-1) l(2i-1) h2i l(2i-2)+1 REGION 2 h3 l3 h1 value axis REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis Regions M regions associated with each MBR; boundaries of ith region:
t1 t2 Regions • ith region is active at time instant t if it spans across t • The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 h3 value axis l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis
t1 MINDIST(Q,R) = MINDIST Computation For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U
Approximate Search • A simpler definition of the distance in the feature space is the following: • But there is one problem… what? DLB(Q’,S)
Multimedia dbs • A multimedia database stores also images • Again similarity queries (content based retrieval) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help!
Images - color what is an image? A: 2-d array
Images - color Color histograms, and distance function
Images - color Mathematically, the distance function is:
Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color
possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color
Images - color time performance: seq scan w/ avg RGB selectivity
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? Images - shapes
distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes
Performance: ~10x faster Images – shapes log(# of I/Os) all kept # of features kept
Is d(u,v) = sqrt ((u-v)TA(u-v) ) a metric? • xTAx = Σ xixjAij = Σλixi2 • λi is the ith eigenvalue • xi is the projection of x along the ith eigenvector • d(u,v) = sqrt ((u-v)TA(u-v) ) = sqrt (Σλi(ui-vi)2 ) • d(u,v) >= 0, d(u,u) = 0, d(u,v) = d(v,u) • d(u,w) <= d(u,v) + d(v,w), provided • sqrt (Σ λi(ui-wi)2 ) <= sqrt (Σ λi(ui-vi)2 ) + sqrt(Σ λi(vi-wi)2 ) • sqrt(Σ (√λi ui- √λiwi)2 ) <= sqrt(Σ (√λiui- √λivi)2 ) + sqrt(Σ(√λivi- √λiwi)2 ) • Metric condition for Lp norm
Filtering in QBIC • Histogram column vectors x, y of length n • Σ xi = 1, Σ yi = 1 • Difference z = (x-y) • Σ zi = 0 • Contributionof each color bin to a smaller set of colors: • VT = (c1, c2,.., cn), each ci is a column vector of length 3 • xavg = VT x, yavg = Vty, column vectors of length 3
Filtering in QBIC • Distances • davg2 = (xavg - yavg)T(xavg - yavg) = (VT z)T(VT z) = zTVVt z = zTWz • dhist2 = zTAz • dhist2 >= λ1davg2 , where λ1 is the smallest eigenvalue of • A’z = λW’z
Filtering in QBIC • Rewrite z to remove the extra condition that Σ zi = 0. • z’ becomes a (n-1) dimensional column vector • zTAz = z’TA’z’ and zTWz = z’TW’z’ • A’ and W’ are (n-1)x(n-1) matrices • Show that z’TA’z’ >= λ1z’TW’z’
Proof of z’TA’z’ >= λ1z’TW’z’ • Minimize wrt z’, z’TA’z’, subject to the constraint z’TW’z’ = C. • Same as minimizing wrt z’, • z’TA’z’ - λ(z’TW’z’ - C) • Differentiate wrt z and set to 0 • A’z’ = λW’z’ • λ and z’ must be eigenvalues and eigenvectors resp. of • A’z’ = λW’z’
Proof of z’TA’z’ >= λ1z’TW’z’ • z’TA’z’ = λz’TW’z’ = λC • To minimize z’TA’z’ , we must choose the smallest eigenvalue λ1. • The minimization of z’TA’z’, under z’, subject to the constraint z’TW’z’ = C equals λ1C • If z’TW’z’ = C > 0 then • z’TA’z’ >= λ1C • If z’TW’z’ = 0 then • z’TA’z’ >= 0, A’ is positive semi-definite