450 likes | 572 Views
Adaptive Data Structures. Towards Declarative Queries on. Simon Zeltser. Based on the article by Nicolas Bruno and Pablo Castro. Contents. 1. Introduction. 2. LINQ on Rich Data Structures. 3. LINQ Query Optimization. 4. Conclusions and Discussion. Introduction. THE PROBLEM
E N D
Adaptive Data Structures Towards Declarative Queries on Simon Zeltser Based on the article by Nicolas Bruno and Pablo Castro
Contents 1 Introduction 2 LINQ on Rich Data Structures 3 LINQ Query Optimization 4 Conclusions and Discussion Technion
Introduction • THE PROBLEM • There is an increasing number of applications that need to manage data outside the DBMS • Need for a solution to simplify the interaction between objects and data sources • Current solutions lack rich declarative query mechanism • THE NEED • Unified way to query various data sources • THE SOLUTION • LINQ (Language Integrated Query) Technion
Introduction • LINQ : Microsoft.NET 3.5 Solution • Accessing multiple data sources via the same API • Technology integrated into the programming language • Supports operations: • Traversal – grouping, joins • Filter – which rows • Projection –which columns vargraduates =fromstudent in students wherestudent.Degree = “Graduate” orderbystudent.Name, student.Gender, student.Age selectstudent; • BUT… The default implementation is simplistic • Appropriate for small ad-hoc structures in memory Technion
Introduction • THE GOAL OF THIS SESSION • Introduce LINQ key principles • Show model of customization of LINQ’s Execution Model on Rich Data Structures • Evaluate the results Technion
LINQ – High Level Architecture C# 3.0 Visual Basic Other Languages… .NET Language Integrated Query (LINQ) LINQ – Enabled Data Sources LINQ To Objects LINQ To Entities LINQ To XML LINQ To Datasets LINQ To SQL <xml > Objects Databases XML Technion
Compare two approaches • Iteration List<String>matches = new List<String>(); // Find the matches foreach(stringitemindata) { if (item.StartsWith("Eric")){ matches.Add(item); } } // Sort the matches matches.Sort(); // Print out the matches foreach(stringiteminmatches) } Console.WriteLine(item); { • LINQ // Find and sort matches varmatches = fromn in data wheren.StartsWith("Eric") orderbyn selectn; // Print out the matches foreach(varmatchinmatches) } Console.WriteLine(match); { Technion
Language Integration Lambda Expressions Function intStringLength(String s) { returns.Length(); { Lambda Expression s => s.Length(); Query Syntax var matches = from n in data wheren.StartsWith("Eric") orderby n select n; var matches = data .Where(n => n.StartsWith("Eric")) .OrderBy(n => n) .Select(n => n) Extension Methods publicstaticIEnumerable<TSource> Where<TSource> (thisIEnumerable<TSource> source, Func<TSource, bool> predicate) var name = "Eric"; var age = 43; var person = new { Name = "Eric", Age = 43 }; var names = new [] {"Eric", "Ryan", "Paul" }; foreach (var item in names) Anonymous Types Technion
LINQ - Example Lambda Expressions // Retrieve all CS students with more // than 105 points var query = from stud in students where ( stud.Faculty == “CS” && stud.Points > 105) orderbystud.Pointsdescending selectnew { Details = stud.Name + “:” + stud.Phone }; // Iterate over results foreach(var student in query) { Console.WriteLine(student.Details); } Query Syntax Extension Methods Anonymous Types Technion
Customizing LINQ Execution Model EXPRESSION TREES • LINQ represents queries as in-memory abstract syntax tree • Query description and implementation are not tied together THE PROBLEM • The default implementation of the operations uses fixed, general purpose algorithms SUGGESTED SOLUTION • Change how the query is executed without changing how it’s expressed • Analyze alternative implementations of a given query and dynamically choose the most appropriate version depending on the context. * + 1 5 7 Technion
Customizing LINQ Execution Model (2) PROBLEM EXAMPLE WHERE operator is implemented by performing a sequential scan over the input and evaluating the selection predicateon each tuple! 1 int[] A = {1, 2, 3, 10, 20, 30}; var q = from x in A where x < 5 select 2*x; foreach(intiin q) Console.WriteLine(i); var q = A.Where(x=>x<5) .Select(x=>2*x); 2 IEnumerable<int> q = Enumerable.Project( Enumerable.Where(A, AF1), AF2); bool AF1(int x) { return x<5; } int AF2(int x) { return 2*x; } 3 Query Implementation: IEnumerable<int> res = new List<int>(); foreach(int a in A) if (AF1(a)) res.Add(AF2(a)); return res; Technion
ForeignKeyConstraint Rich Data Structures - DataSet • In-memory cache of data • Typically populated from a database • Supports indexing of DataColumns via DataViews DataSet object DataTable object DataTable object DataRow Unique Constraint Data Column Unique Constraint • We will use LINQ on DataSet for demonstrating query optimization techniques Technion
LINQ on Rich Data Structures • Enable LINQ to work over DataSets. EXAMPLE • Given R and S – two DataTables from r in R.AsEnumerable() join s in S.AsEnumerable() on r.Field<int>(“x”) equals s.Field<int>(“y”) selectnew { a = r.Field<int>(“a”), b = s.Field<int>(“b”) }; Compile and run-time phases on an implementation of our prototype LINQ on DataSet Standard C# Code Interm. Language Expression Tree Optimized Expression Tree Interm. Language Compile Time Run Time Self-tuning State DataSet Technion
Expression Tree Optimizer Our solution will be built according to the following architecture Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion
Query Cost Estimator Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion
Query Estimation - Cost Model • Follow traditional database approach: • COST: {execution plans} -> [expected execution time] • Relies on: • a set of statistics maintained in DataTablesfor some of its columns • formulas to estimate selectivity of predicates and cardinality of sub-plans • formulas to estimate the expected costs of query execution for every operator Technion
Cardinality Estimation • Returns an approximate number of rows that each operator in a query plan would output • To reduce the overhead, we will use only these statistical estimators: • maxVal – maximum number of distinct values • minVal– minimum number of distinct values • dVal – number of distinct values in a column • If statistics are unavailable, rely on “magic numbers” until automatically creation of statistics Technion
Predicate Selectivity Estimation • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T): COST(T) = Card(T) * MEM_ACCESS_COST For each p in {operators of T} Average Cost Of Memory Access Technion
Predicate Selectivity Estimation Intuition: We model sel(co<=c<=c1) as the probability to get a “c” value in interval [c0, c1] among all possible “c” values c1 maxVal(c) c • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T: COST(T) = Card(T) * MEM_ACCESS_COST c0 minVal(c) For each p in {operators of T} Average Cost Of Memory Access Technion
Predicate Selectivity Estimation Consider now a join predicate: T1 c1=c2T2 • Card(T1 c1=c2 T2)= • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T): COST(T) = Card(T) * MEM_ACCESS_COST For each p in {operators of T} Average Cost Of Memory Access Technion
Query Analyzer Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion
Execution Alternatives • Rely on indexes on DataColumns when possible • Example: σa=7∧(b+c)<20 5 Index on “a” column 3 7 Alternative 1: Alternative 2: 2 4 Full Table Scan a=7 b+c < 20 Technion
Analyzing Execution Plans • Global vs. Local Execution Plan – EXAMPLE: • Local Execution Plan • HashJoin? • IndexJoin? • MergeJoin? Global Execution Plan Technion
Enumeration Architecture • Two phases: • First phase: Join reordering based on estimated cardinalities • Second phase: Choose the best physical implementation for each operator • EXAMPLE:Suppose we analyze JOIN operator. • We evaluate the following JOIN implementations: • Hash Join • Merge Join (inputs must be sorted in the join columns) • Index Join (index on the inner join column must be available) • Other possible calculation options • Choose the alternative with the smallest cost Technion
Query Analysis Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion
Self Tuning Organization • We want to reach the smallest query execution time. • Indexes can be used to speedup query execution • PROBLEM: • It might become problematic to forecast in advance what indexes to build for optimum performance • SOLUTION: • Continuous monitoring/tuning component that addresses the challenge of choosing and building adequate indexes and statistics automatically Technion
Self Tuning Organization - Example • Consider the following execution plan: • The selection predicate Name=“Pam” over CustomersDataTable can be improved if an index on Customers(Name) is built • Both hash joins can be improved if indexes I2 and I3 are available, since we can transform hash join into index join * The three sub-plans enclosed in dotted lines might be improved if suitable indexes were present Technion
Algorithm for automatic index tuning Technion
Index Tuning • High-Level Description: • Identify a good set of candidate indexesthat would improve performance if they were available. • Later, when the optimized queries are evaluated, we aggregate the relative benefits of both candidate and existing indexes. • Based on this information, we periodically trigger index creations or deletions, taking into account storage constraints, overall utility of the resulting indexes, and the cost to creating and maintaining them. Technion
Algorithm for automatic index tuning Technion
Index tuning algorithm • Notation: • H – a set of candidate indexes to materialize • T – task set for query qi • Ii – either a candidate or an existing index • δIi – amount that I would speed up query q H (initially empty) Task Set I1, δI1 I2, δI2 . . … In, δIn Technion
Algorithm for automatic index tuning Technion
Index tuning algorithm • Notation: • ΔI – value maintained for each index I • Materialized index – already created one • SELECT query: ΔI = ΔI + δI • UPDATE query: ΔI = ΔI – δI H I1 Task Set I1, δI1 I1, δI1 I2, δI2 . . … In, δIn Technion
Index Tuning Algorithm • The purpose of ΔI: • We maintain ΔI on every query evaluation • If the potential aggregated benefit of materializing a candidate index exceeds its creation cost, we should create it, since we gathered enough evidence that the index is useful Technion
Algorithm for automatic index tuning Technion
Index tuning algorithm • Remove “bad” indexes phase • Notation: • Δmin– minimum Δ value for index I • Δmax– maximum Δ value for index I • BI – the cost of creating index I • Residual(I)= BI – (Δmax – Δ) (the “slack” an index has before being deemed “droppable”) IF (Residual(I)) <= 0) THEN Drop(I) • Net-Benefit(I) = (Δ-Δmin)-BI (the benefit from creating the index) IF (Net-Benefit(I) >= 0) THEN Add(I) Technion
Algorithm for automatic index tuning Technion
Index tuning algorithm • Notation: • ITM – all the indexes from H which creation is cost effective • ITD – subset of existing indexes such that: • ITD fits in existing memory • It’s still cost effective to create new index I after possibly dropping members from ITD • If creating index I is more effective than maintaining existing indexes in ITD, DROP(ITD) && CREATE(I) • Remove I from H (set of candidate indexes to materialize) Technion
Experimental Evaluation Consider the following schema: checkCarts($1) = from p inProducts.AsEnumerable() join cart inCarts.AsEnumerable() onp.Field<int>("id") equalscart.Field<int>("p_id") join c inCustomers.AsEnumerable() oncart.Field<int>("cu_id") equalsc.Field<int>("id") where c.name = $1 selectnew { cart, p } browseProducts($1) = from p inProducts.AsEnumerable() join c inCategories.AsEnumerable() onp.Field<int>("ca_id") equals c.Field<int>("id") where c.par id = $1 select p • Generated: • 200,000 products • 50,000 customers • 1,000 categories • 5,000 items in the shopping carts Possible Indexes I1 Categories(par_id) I2 Products(c_id) I3 Carts(cu_id) I4 Products(ca_id) I5 Customers(name) Technion
Experimental Evaluation – Cont. Generated schedule when tuning was disabled Technion
Experimental Evaluation – Cont. Generated schedule when tuning was enabled Technion
Summary • We’ve discussed: • LINQ – for declarative query formulation • DataSet - a uniform way of representing in-memory data. • A lightweight optimizer for automatically adjusting query execution strategies • Article’s main contribution: • NOT a new query processing technique • BUT: careful engineering of traditional database concepts in a new context Technion
Thank You ! Simon Zeltser
LINQ Execution Model • Compiler merges LINQ extension methods • Query syntax is converted to function calls and lambda expressions • Lambda expressions are converted to expression trees • Adds query operations to IEnumerable<T> • Expressions are evaluated at run-time • Parsed and type checked at compile-time • At compile time • Compiler finds a query pattern • Query is executed lazily • Compiler infers types produced by queries • Datasets are strongly typed • Operations on • data sets are strongly typed • Specialized or base • Can optimize and • re-write query • Expressions and operations • can execute remotely • At run-time, when results are used • We can force evaluations (ToArray()) Technion