350 likes | 483 Views
Special Topics in Data Engineering. Panagiotis Karras CS6234 Lecture, March 4 th , 2009. Outline. Summarizing Data Streams. Efficient Array Partitioning. 1D Case. 2D Case. Hierarchical Synopses with Optimal Error Guarantees. Summarizing Data Streams.
E N D
Special Topics in Data Engineering PanagiotisKarras CS6234 Lecture, March 4th, 2009
Outline • Summarizing Data Streams. • Efficient Array Partitioning. 1D Case. 2D Case. • Hierarchical Synopses with Optimal Error Guarantees.
Summarizing Data Streams • Approximate a sequence [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that an error metric is minimized. • Data arrive as a stream: Seen only once. Cannot be stored. • Objective functions: Max. abs. error: Euclidean error:
Histograms [KSM 2007] • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric. Each value di defines a tolerance interval Bucket closed when running intersection of interval becomes null Complexity:
Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space , with actual error , run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Complexity: Independent of buckets B What about streaming case?
Streamstrapping [Guha 2009] • Metric error satisfies property: • Run multiple algorithms. • Read first B items, keep reading until first error (>1/M) 2. Start versions for • When a version for some fails, a) Terminate all versions for b) Start new versions for using summary of as first input. 4. Repeat until end of input.
Streamstrapping [Guha 2009] • Theorem: For any StreamStrap algorithm achieves an approximation, running copies and initializations. • Proof: Consider lowest value of for which an algorithm runs. Suppose error estimate was raised j timesbefore reaching Xi : prefix of input just before error estimate was raised for ith time. Yj : suffix between (j-1)th and jth raising of error estimate. Hi : summary built for Xi. Then: Furthermore: Error estimate is raised by at every time. target error added error recursion
Streamstrapping [Guha 2009] • Proof (cont’d): Putting it all together, telescoping: Total error is: Moreover, However, (algorithm failed for it) Thus, In conclusion, total error is # Initializations follows. added error optimal error
Streamstrapping [Guha 2009] • Theorem: Algorithm runs in space and time. • Proof: Space bound follows from copies. Batch input values in groups of Define binary tree of t values, compute min & max over tree nodes: Using tree, max & min of any interval computed in Every copy has to check violation of its bound over t items. Non-violation decided in O(1). Total Violation located in . For all buckets, Over all algorithms it becomes:
1D Array Partitioning [KMS 1997] • Problem: Partition an array of n items into p intervals so that the maximum weight of the intervals is minimized. Arises in load balancing in pipelined, parallel environments.
1D Array Partitioning [KMS 1997] • Idea: Perform binary search on all possible O(n2) intervals responsible for maximum weight result (bottlenecks). • Obstacle: Approximate median has to be calculated in O(n) time.
1D Array Partitioning [KMS 1997] • Solution: Exploit internal structure of O(n2) intervals. n columns, column c consisting of Monotonically non-increasing
1D Array Partitioning [KMS 1997] • Calls to F(...) need O(1). (why?) • Median of any subcolumn determined with one call to F oracle. (how?) Splitter-finding Algorithm: • Find median weight in each active subcolumn. • Find median of medians m in O(n) (standard). • Cl (Cr): set of columns with median < (>) m.
1D Array Partitioning [KMS 1997] • The median of medians m is not always a splitter.
1D Array Partitioning [KMS 1997] • If median of medians m is not a splitter, recur to set of active subcolumns (Cl or Cr) with more elements (ignored elements still considered in future set size calculations). • Otherwise, return m as a good splitter (approximate median). End of Splitter-finding Algorithm.
1D Array Partitioning [KMS 1997] Overall Algorithm: • Arrange intervals in subcolumns. • Find a splitter weight m of active subcolumns. • Check whether array is partitionable in p intervals of maximum weight m (how?) • If true, then m is upper bound of optimal maximum weight, eliminate half of elements of each subcolumn in Cl - otherwise in Cr. • Recur until convergence to optimal m. Complexity: O(n log n)
2D Array Partitioning [KMS 1997] • Problem: Partition a 2D array of n x n items into a p x p partition (inducing p2 blocks) so that the maximum weight of the blocks is minimized. Arises in particle-in-cell computations, sparse matric computations, etc. • NP-hard [GM 1996] • APX-hard [CCM 1996]
2D Array Partitioning [KMS 1997] • Definition: Two axis-parallel rectangles are independent if their projections are disjoint along both the x-axis and the y-axis. • Observation 1: If an array has a partition, then it may contain at most independent rectangles of weight strictly greater than W. (why?)
2D Array Partitioning [KMS 1997] • At least one line needed to stab each of the independent rectangles. • Best case: independent rectangles
2D Array Partitioning [KMS 1997] The Algorithm: Assume we know optimal W. Step 1: (define P ) Given W, obtain partition such that each row/column within any block has weight at most 2W. (how?) Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists ?)
2D Array Partitioning [KMS 1997] Step 2: (from P to S ) Construct set of all minimal rectangles of weight more than W, entirely contained in blocks of . (how?) Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones. Property of S : block weight at most 3W. (why?) Hint : rows/columns in blocks of P at most 2W.
2D Array Partitioning [KMS 1997] Step 3: (from S to M ) Determine local 3-optimal set of independent rectangles. 3-optimality : There does not exist set of independent rectangles in that, added to after removing rectangles from it, do not violate independence condition. Polynomial-time construction (how? with swaps: local optimality easy)
2D Array Partitioning [KMS 1997] Step 4: (from M to new partition) For each rectangle in M, set two straddling horizontal and two straddling vertical lines that induce it. At most partition derived New partition: P from step 1 together with this. horizontal lines vertical lines
2D Array Partitioning [KMS 1997] Step 5: (final) Retain every th horizontal line, every th vertical line. Maximum weight increased at most by
2D Array Partitioning [KMS 1997] Analysis: We have to show that: • Given W (large enough) such that there exists partition, the maximum block weight in constructed partition is • Minimum W for which analysis holds (found by binary search) is upper bound to optimum W.
2D Array Partitioning [KMS 1997] Lemma 1: (at Step 1) Let block b contained in partition P. If b exceeds 27W, then b can be partitioned in 3 independent rectangles of weight >W. Proof: Vertical scan in b, cut as soon as seen slab weight exceeds 7W. (hence slab weight < 9W ) (why?) Horizontal scan, cut as soon as one seen slab weight exceeds W.
2D Array Partitioning [KMS 1997] Proof (cont’d): Slab weight exceeding W does not exceed 3W. (why?) Eventually, 3 rectangles weighting >W each.
2D Array Partitioning [KMS 1997] Lemma 2: (at Step 4) Weight of any block of Step-4-partition is Proof: Case 1: Weight of b is O(W). (recall block in S <3W ) Case 2: Weight of b is <27W. If >27W, then bpartitionable in 3 independent rectangles, which can substitute the at most 2 blocks in M non-independent of b: violates 3-optimality of M.
2D Array Partitioning [KMS 1997] Lemma 3: (at Step 3) If , then Proof: Weight of rectangles in M is >W. By Observation 1, at most independent rectangles can be contained in M.
2D Array Partitioning [KMS 1997] Lemma 4: (at Step 5) If , weight of any block in final solution is Proof: At Step 5, maximum weight increased at most by By Lemma 2, maximum weight is Hence, final weight is (a) Least W for which Step 1 and Step 3 succeed exceeds optimum W. Found by binary search. (b)
Compact Hierarchical Histograms • Assign arbitrary values to CHHcoefficients, so that a maximum-error metric is minimized. • Heuristic solutions: Reiss et al. VLDB 2006 c0 c1 c2 time c5 c3 c6 space c4 d0 d1 d2 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]
ci ci c2i c2i c2i+1 Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case
Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case time Complexity: (space-efficient) space • Apply to the space-bounded problem. Complexity: Polynomially Tractable
References • P. Karras, D. Sacharidis, N. Mamoulis: Exploiting duality in summarization with deterministic guarantees. KDD 2007. • S. Guha: Tight results for clustering and summarizing data streams . ICDT 2009. • S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997. • F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. • P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008.