220 likes | 404 Views
Query optimisation. Example - hospital database. Doctors. 100 tuples. Patients. 2500 tuples. Query. Get the name of the doctors who treat patients suffering from the prk11 disease. SELECT Doctors.name FROM Doctors, Patients WHERE Disease = ‘prk11’ AND D_name = Doctors.Name.
E N D
Example - hospital database Doctors 100 tuples Patients 2500 tuples
Query Get the name of the doctors who treat patients suffering from the prk11 disease SELECT Doctors.name FROM Doctors, Patients WHERE Disease = ‘prk11’ AND D_name = Doctors.Name
Evaluation #1 • restrict Patients to those who suffer from prk11 • read: 2500 tuples; result: estimated 50 tuples; no need to write intermediate result - sufficiently small • join above result with Doctors • read: 100 tuples (Doctors); result 50 tuples; no need to write to disk intermediate result • project result over Doctors.name • the desired result is in the memory • estimated cost (read and write) 2600
Evaluation #2 • suppose the internal memory allows only some 350 tuples • join Patients with Doctors • read Patients in batches of 250 tuples; therefore read Doctors 10 times; read: 2500 + 1000 = 3500; write intermediate result (too big) to disk: 2500; • restrict above result • read 2500; result: estimated 50 tuples; • project • cost: 8500 (read and write)
Intermediate conclusions • the evaluation strategy (procedural aspect) can lead to very big differences in computation time, for the same query • computation time: read from and write to disk (quintessential) • processor time • the actual evaluation procedures are far more complex than in the previous introductory example
Optimisation - what • deciding upon the best strategy of evaluating a query • it is performed automatically by the optimiser of the DBMS • not just for data retrieval operations, but for updating operations as well (e.g. UPDATE) • not guaranteed to give the best result
Optimisation - how • based on statistical information about the specific database (not necessarily, though) perform • expression transformation (cast query in some internal form and convert to respective canonical form • candidate low level procedures selection • query plans generation and selection • statistical information - could you think of examples? • cardinality of base relations, indexes, ...
Cast (transform) query in some internal form • internal format • more suitable for automatic processing • trees (syntax tree or query tree) • from a conceptual point of view is is easier to assume that the internal format is relational algebra
Convert to canonical form • the initial expression is transformed into an equivalent but more efficient form • “efficient form” = efficient when executed • these transformation are performed independently from actual data values and access paths
Expression transformation • examples (A WHERE condition#1) WHERE condition#2 (A WHERE condition#1 AND condition#2) (A [projection#1] ) [projection#2] A [projection#2] (A [projection]) WHERE condition (A WHERE condition) [restriction]
Expression transformation • distributivity • commutativity and associativity • idempotence • scalar expressions • conditional expressions • semantic transformation
Set level operations • the operators of relational algebra are set level • i.e. they manipulate sets (relations) and not individual tuples • however, these operators are implemented by internal (DBMS) procedures • these procedures, inherently, need tuple-access (in fact, they need access to scalar values)
Choose candidate low-level procedures • the optimiser decides how to execute the query (expressed in canonical form) • access paths are relevant at this stage • in the main, each basic operation (join , restriction, …) has a set of procedures that implement it • e.g. RESTRICTION - (1) on candidate key; (2) on indexed key; (3) on other attributes … • each procedure has associated a cost function (usually based on the required I/O disk operations); these functions are used in the next stage
Implementing JOIN - examples • R and P - two relations to be joined • J - the attribute on which the (natural) join is performed • R[i] and P[j] mean the i-th tuple of R and the j-th tuple of P, respectively • R[i].J means the value of the attribute J for the i-th tuple of the relation R • R has M and P has N tuples, respectively
Implementing JOIN - brute force for i:=1 to M for j := 1 to N do if R[i].J = P[j].J then add joined tuple R[i]*P[j] to result end end end
Index lookup index X on Patients.D_name
Implementing JOIN - index lookup /* index X on P.J */ for i:=1 to M for j := 1 to K[i] do add joined tuple (R[i] * PK[j]) to result /* PK[j] represents the tuple of P that K[j] points to */ end end
Choose the cheapest query plan • construct query plans (query evaluation plan) • combine candidate low level procedures • choose the cheapest • total cost = the sum of individual costs • individual costs depend on the actual data values; estimates are used instead, based on statistical data • usually not all possible evaluation procedures are generated; the search space is reduced by applying heuristics
Database statistics - in the data dictionary • for each base table • cardinality • space occupied • etc. • for each column of each base table • no of distinct values • maximum, minimum and average values • histogram of values • … • ...
An optimiser is never perfect • the following example is a real life example • suppose a Postgres definition for • base relation: Treatment(Patient, Drug, Disease, …) • the query • get all the drugs that are taken by patients that suffer from prk11 • (all the drugs, not only those for prk11) SELECT DISTINCT Drug FROM Treatment WHERE Patient IN (SELECT Patient FROM Treatment WHERE Disease = ‘prk11’) ; • the query is far slower that the equivalent one (next) ...
An optimiser is never perfect /* this query is faster than the previous one, even though it seems to be performing more computations - Patient is not unique! */ CREATE VIEW V_Treatment AS SELECT * FROM Treatment SELECT DISTINCT Treatment.Drug FROM Treatment, V_Treatment WHERE Treatment.Patient = V_Treatment.Patient AND Disease = ‘prk11’ ;