160 likes | 309 Views
Speeding Up Warehouse Physical Design Using A Randomized Algorithm. Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida. View Selection Problem. What’s a Data Warehouse?
E N D
Speeding Up Warehouse Physical DesignUsing A Randomized Algorithm Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida
View Selection Problem • What’s a Data Warehouse? • stores info. collected from multiple, heterogeneous info. sources to support complex querying and analysis • Materialized Views in a DW • pre-computed portions of frequently asked queries • maintenance : incremental, periodic refresh • View Selection Problem • decide which views to materialize in DW • considers query response time, maintenance cost(?), and storage cost
Overview of Our Problem • Maintenance-cost View Selection Problem [GM99] • decide which views to materialize in DW • minimize query response time, given an upper bound on maintenance cost (storage space is not considered) • DW Configuration based on OR view graphs • Any view can be computed from any of its related views
Problems with existing Solutions • Existing Solutions to the View Selection Problem • Heuristics-based Search • Greedy Heuristics [Gup97,GM99] • A* [Rou82, LQA97, GM99] • Exhaustive Search [TS97] • Problems • Does not scale up well for more than 20 views • Time complexity is polynomial • DW evolution requires efficient re-computation of a configuration
Outline of Our Approach • Use Randomized Algorithms • Randomized algorithms provide good solution within a small amount of time (time/quality tradeoff) • Specifically, use Genetic Algorithms (GA) • Advantages of Our Approach • Near linear scaleup with a solution within 90% of optimal • Support DW evolution with fast reconfiguration
Genetic Algorithms Loop until termination condition = true t=t+1 Select P(t) from P(t-1) Recombine P(t) Evaluate P(t)
GA : Representation of Solution • Genome • Candidate Solution of the problem to be solved • Represented as a String • Ordering problems : Alphanumeric String Selection problems : Binary String • Binary String Representation • ex) v1 v2 v3 v4 v5 0 1 0 0 1 views v2 and v5 are selected
GA : Initialization of Population • Initial Population in our experiments • Pool of randomly generated bit strings • population size is 300 • Future experiments • generate more favorable initial population • use external knowledge of problem
GA : Selection, Crossover, Mutation, Termination • Selection • Select superior genomes among previous population • Roulette Wheel Method [Mic94] • Crossover • applied to two genomes by exchanging information • Mutation • applied to a single genome : ex) flip a bit in the genome • Termination • termination condition : 400 generations
GA : Evaluation Process • Fitness Function • measures how good a genome is as a solution • high : close to optimal, low : further from optimal • Use Penalty Function in Fitness Function • similar solution to 0/1 knapsack solution[Mic94] • Evaluate query benefit. If maintenance limit is exceeded, apply penalty.
GA : Evaluation Process • Penalty Functions • Logarithmic (LG) • Linear (LN) • Exponential (EX) • Penalty Application Methods • Subtract (S) • Divide (D) • Subtract&Divide (SD)
Evaluation of the Algorithm • Environment • Pentium II 450 MHz PC, Windows NT 4.0 • OR-view graphs • number of base tables : 10 tables • number of views : 5-20 views • edge density of graph : 15%, 30%, 50%, 75% • parameters for node (view) & parameters for edge RC : 100 - 10000 for base tables QC : 10 - 80% of RC of QF : 0.1- 0.9 source view UF : 0.1 - 0.9 MC : 10 - 150% of QC
Prototype Development • Used version 2.4.3 of Galib from MIT • Microsoft Visual C++ • Encoded our own Fitness Function • strategy for penalty is controlled by a control variable • Encoded OR-view graph cost evaluation functions • total query cost, total maintenance cost • OR-view graph costs • Node: Read Cost, Query Frequency, Update Frequency • Edge: Query Cost, Maintenance Cost
Conclusion • Use of Genetic Algorithm for Maintenance-cost View Selection Problem • yields a solution within 10% of optimal solution • linear scale up for execution time w.r.t number of views • EX-D and EX-SD strategy produce best results • Suitable for use in DW evolution • Future work • experiments with better initial population • various crossover and mutation operators, termination condition • AND-OR views, indexes • parallel version of GA