480 likes | 605 Views
Embedded Databases. Krithi Ramamritham IIT Bombay. Small Devices are Proliferating. Handhelds, Cellphones, Sensors D evices are resource constrained Applications Personal Info Management E-diary Enterprise Applications Health-care, Micro-banking Sensor (Network)s
E N D
Embedded Databases Krithi Ramamritham IIT Bombay
Small Devices are Proliferating • Handhelds, Cellphones, Sensors • Devices are resource constrained • Applications • Personal Info Management • E-diary • Enterprise Applications • Health-care, Micro-banking • Sensor (Network)s • Tracking, Monitoring, aggregation
Handhelds in Healthcare • Medical professionals store current data in the handheld • Sensors may be attached to devices • Update local copy and sync with backend • Need for information anytime and anywhere • Decision support, need to extract data from history • Need to aggregate information from data Reduce communication costs -- a lot of (sub)query processing / aggregation must be done on the device itself.
Sample healthcare database schema Doctor (Docid, name, specialty,…) Visit (Visitid, Docid, date, diagnostics,..) Drug (Drugid, name, type) Prescription (Visitid, Drugid, qty,..) Q1: Who prescribed Antibiotics (to this patient) in 2004? • Multi-table Join Q4: Number of prescriptions per doctor and per type of drug? • Multi-table Join with aggregations
Sensor Networks • Collect data from the environment • Subject data to various queries • health monitoring, habitat monitoring, defence applications • Continuous Aggregate Queries • Examples • Current: Mica Mote – 4MHz cpu, 4kB RAM, 128kB code space, 512 kB EEPROM, 50kbps connectivity (future: match box sized, processor running at several MHz, data memory of several tens of Mbytes) • Reduce communication costs -> a lot of (sub)query processing / aggregation must be done on the device itself.
Embedded Database SW DB Software that is directly in contact with, or is significantly affected by, the hardware that it executes on, or can directly influence the behavior of that hardware.
Need for Small Footprint DBMSs Data management is important • Increasing number of applications • They deal with a fair amount of data • Complex queries involving joins and aggregates • Atomicity and Durability for data consistency • Need temporal consistency / Synchronization • Ease of application development • Require Data Privacy A device resident DBMS is needed
Challenges Tightly constrained • cost, size, performance, power, etc. • limited computing power and main memory • limited stable storage DB is Flash / main memory based • Reads – similar overheads • Writes – flash memory orders of magnitude slower
Lightweight versions of Popular DBMSs Deploy reduced footprint codebase by stripping down db features • e.g. , flat storage, nested loop join • limited support for consistency preserving updates • subset of SQL • DB support not resource cognizant
But, resources are not uniform across devices • Handhelds • Tungsten - 64MB + upto 1GB additional storage • Simputer - 32MB + 24 MB Flash memory, Smart card • Cell phones • 400 – 1600 kB + • Sensors • Mica Mote – 4kB RAM, 128kB code space, 512 kB EEPROM Need tailored DBMSs given • devicecharacteristics and • application requirements
DELite: Overall Philosophy Storage Management • Reduce storage cost to a minimum • Aim at compactness in representation of data • Limited storage could preclude any additional index • But, to speed up query processing, storage model should try to incorporate some index information Query Processing • Memory constrained query processing • existing systems use Minimum memory algorithms -- do not work well for complex joins and aggregates • Minimize writes to secondary storage • Efficient usage of limited main memory • Optimal memory allocation among operators
Visit relation p Car11 10 20 q Obg2 r Car11 30 s 40 Car11 Relation Domain 10 p Car11 20 q Obg2 30 r 40 s Storage Management • Existing storage models • Flat Storage • Tuples are stored sequentially. • Duplicates not eliminated • Pointer-based Domain Storage • Values partitioned into domains which are sets of unique values • Tuples reference the attribute value by means of pointers • One domain shared among multiple attributes Can we further reduce the storage cost while retaining simplicity?
Relation R ID Values Domain Values 0 1 v 2 v1 1 n vn 0 n Positional Indexing Projection Index Storage Management: ID Storage An identifier for each of the domain values • Identifier can be the ordinal value in the domain table • Store the (smaller) identifier instead of the pointer • Use the identifier as an offset into the domain table • D domain values can be distinguished by identifiers of length log2D /8 bytes. • ID based Storage wins over Domain Storage when pointer size > log2 D/8 • Above condition almost always true
ID Storage … • Extendable IDs are used. Length of the identifier grows and shrinks depending on the number of domain values • To reduce reorganization of data, ID values are projected out from the rest of the relation and stored separately maintaining Positional Indexing. Why not bit identifiers? • Storage is byte addressable. • Packing bit identifiers in bytes increases the storage management complexity.
Choice of Storage Schemes Size of relation Selectivity and length of attributes Frequency of updates • Flat storage – 31K • ID storage – 28.5 K • ID Join Index – 18 K Doctors – 91, Drugs – 77 Visits–830, Prescriptions – 2155
Storage -> Query Processing Storage Management • Reduce storage cost to a minimum • Aim at compactness in representation of data • Limited storage could preclude any additional index • But, to speed up query processing, storage model should try to incorporate some index information Query Processing • Memory constrained query processing • existing systems use Minimum memory algorithms -- do not work well for complex joins and aggregates • Minimize writes to secondary storage • Efficient usage of limited main memory • Optimal memory allocation among operators
Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation
Query exec plans for Q1Who prescribed Antibiotics in 2003? -- selection Doctors who prescribed antibiotics during visits in 2003 59 -- join Prescriptions given by doctors visited in 2003 77 691 Doctors visited in 2003 Drug (77) 270 Pres (2155) Visits in 2003 270 Doctor (91) Doctor (Docid, name, specialty,…) Visit (Visitid, Docid, date, diagnostics,..) Drug (Drugid, name, type) Prescription (Visitid, Drugid, qty,..) Visit (830)
Choosing right form of query expressionand right sequence of transformations can lead to memory and time optimization 59 59 59 691 59 77 Doctor (91) 691 77 Drug (77) 196105 151 270 Drug (77) 830 Pres (2155) Visit (830) Visit (830) Pres (2155) Doctor (91) 7 Pres (2155) 270 Doctor (91) Drug (77) Visit (830) QEP 2 Cost = 830+ 270*91 + 270*2155 + 691*77 = 660457 QEP 3 Cost = 91*2155 + 196105*830 + 691*77 = 163016462 QEP 1 Cost = 77 + 7*2155 + 155*830 + 59*91 =149181 The best plan
Doctor Visit Pres Operator Schemes Schemes for Join • Nested Loop Join • Indexed Nested Loop Join • Hash Join • … Schemes for aggregation • Nested Loop aggregation • Buffered aggregation Operator schemes implemented using the demand-driven model Drug
Visit DocID Doctor 0 1 v0 2 v1 1 n vn 0 n Foreign Key - Primary Key Join Index Joins using ID based join indices
Evaluation of Query Expressions An evaluation plan defines exactly what scheme is used for each operation • different schemes for an operator – (indexed) nested loop join, hash join, … • have different memory usage and cost and how the execution of the operations is coordinated. Alternatives Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. Pipelining: pass on tuples to parent operations even as an operation is being executed 59 59 Doctor (91) 151 830 Visit (830) 7 Pres (2155) Drug (77)
Need for informed memory allocation If nested loop algorithms are used for every operator, minimum amount of memory is needed to execute the plan • Nested loop algorithms are inefficient • Should memory usage be reduced to a minimum at the cost of performance? • Different devices come with different memory sizes • Query plans should make efficient use of memory • Memory must be optimally allocated among all operators Need to generate the best query execution plan depending on the available memory
Join approach affects memory requirements In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.
Doctor Visit Pres Choice of query evaluation plans Need for Left-deep Query Plan • Reduce materialization, if absolutely necessary use main memory • Bushy trees and right-deep trees are ruled out • Left deep tree is most suited for pipelined evaluation • Right operand in a left-deep tree is always a stored relation Other Considerations • Minimize writes to secondary storage • Efficient usage of limited main memory • Main memory as write buffer • If read:write ratio very high, flash memory can be used as write buffer Drug
Doctor Visit Pres Major Issues in choice of query evaluation plan • Order of execution of operators and operands for these • Scheme (algorithm) to use for each operator • Amount of memory to allocate for each scheme Drug
Memory Allocation: using traditional optimizer Phase 1 • Query is first optimized to get a query plan • Scheme for every operator is determined Enough Memory is assumed to be available for all the schemes --- may not be true for a resource constrained device Phase 2 • Division of memory on the basis of cost functions of the schemes => Traditional optimization cannot be used for DELite
Optimal Memory Allocation: 1-Phase Approach Query optimizer is made memory cognizant Optimizer takes into account division of memory among operators while choosing between plans Ideally, 1-phase optimization should be done --- but the optimizer is complex.
Memory Cognizant 2-Phase Approach Optimal use of memory => selecting the best scheme for every operator Phase 1: Determine the optimal left-deep join order (using a dynamic programming approach) using base and intermediate relation cardinalities Phase 2: Divide memory among the operators • based on the (memory, execution time) profiles of operator schemes • such that execution time is optimized
Query exec plans for Q1Who prescribed Antibiotics in 2003? Antibiotics given by doctors visited in 2003 59 Prescriptions given by doctors visited in 2003 77 691 Doctors visited in 2003 Drug (77) 270 Pres (2155) -- selection Visits in 2003 -- join 270 Doctor (91) Visit (830)
Query exec plans for Q1Who prescribed Antibiotics in 2003? 59 59 59 691 59 77 Doctor (91) 691 77 Drug (77) 196105 151 270 Drug (77) 830 Pres (2155) Visit (830) Visit (830) Pres (2155) Doctor (91) 7 Pres (2155) 270 Doctor (91) Drug (77) Visit (830) QEP 2 Cost = 830+ 270*91 + 270*2155 + 691*277 = 660457 QEP 3 Cost = 91*2155 + 196105*830 + 691*77 = 163016462 QEP 1 Cost = 77 + 7*2155 + 155*830 + 59*91 =149181 The best plan
Query exec plans for Q4Number of prescriptions per doctor and per type of drug? GrpBy (Doctor.DocID, Drug. Type) GrpBy (Doctor.DocID, Drug. Type) GrpBy (Doctor.DocID, Drug. Type) 2155 2155 2155 2155 2155 2155 Doctor (91) Drug (77) Drug (77) 2155 830 196105 Visit (830) Pres (2155) Visit (830) Doctor (91) Drug (77) Pres (2155) Visit (830) Doctor (91) Pres (2155) QEP 1 Cost = 77*2155 + 2155*830 + 2155* 91 = 2150690 QEP 2 Cost = 830*91 + 830* 2155 + 2155 * 77 = 2030115 The best plan QEP3 Cost = 91*2155 + 196105*830 +2155*77 = 163129190
Phase 2 Memory Division • Various operator schemes • Schemes conform to left-deep evaluation plan • Schemes use different amounts of memory and have different execution times • Evaluate the benefit of a scheme per unit of additional memory allocation • Divide memory among the operators based on the benefit of their schemes
(s3,b3) (0,c1) (s2,b2) (m2,c2) (m3,c3) (s1,b1) Size (0,0) Benefit Memory Benefit/Size of a scheme Assume n schemes s1, s2,…sn to implement an operator o Minimum scheme for an operator is the scheme that has max. cost and needs min. memory smin -- the minimum scheme for operator o i, 1≤i≤n : Cost(si) ≤ Cost(smin) , Memory(si) ≥ Memory(smin) Every scheme is characterized by a benefit/size ratio which represents its benefit per unit memory allocation Benefit(si) = Cost(smin) – Cost(si) Size(si) = Memory(si) – Memory(smin) Cost
(0,c1) (m2,c2) (m3,c3) (0,0) Memory Phase 2 Memory Division Phase 2 memory division among the operators based on the cost functions of the operators Two approaches • Exact memory allocation • Heuristic memory allocation Cost
(s3,b3) (s2,b2) (s1,b1) Size Benefit Heuristic memory allocation • Determine which operator gains the most per unit memory allocation and allocate memory to that operator • Gain of every operator is determined by its best feasible scheme • Repeat the process till memory allocation is done Heuristic: Select the scheme that has the maximum benefit/size ratio
Summary • Selection of best query execution plan depending on memory available in a device • Resources in the handheld affects the response time in a big way • Response times highest with minimum memory and least with maximum memory • Heuristic memory allocation differed from exact algorithm in a few points only • Response times more for ID Storage due to extra cost in projection • Join Index reduces the query execution time considerably. …. Towards a toolset for synthesizing small footprint DBMSs
Other Challenges… Update management and synchronization have to consider disconnections, mobility and communication cost Compression, operations on compressed data Embedded Operating System provides lesser facilities • e.g. no multi-threading support Better security measures are required as devices are easily stolen, damaged and lost
Value X 0 1 2 3 4 5 Time Updating embedded DBs maintaining Temporal Consistency • Air traffic control • aircraft position, speed, direction, altitude, etc. • 20,000 data entities • validity intervals of 1 ~ 10 seconds • Network services databases • network traffic management data e.g., bandwidth / channel utilization, buffer / space usage
Assign Periods & Deadlines-- Problem & Goals • Problem domain: • maintaining temporal validity of real-time data by periodic update transactions • Goals: assigning periods and deadlines s.t. • update transactions can be guaranteed to complete by their deadlines • the imposed workload is minimized
V t’+V t’ Problem : Maintaining Temporal Validity of Real-Time Data V V: Validity length t t+V • Real-time data refreshed by periodic update sensor transactions • X has to be refreshed before its validity interval expires • validity duration updated upon refresh • How to maintain the validity of data while minimizing the workloads due to update transactions ?
D P t t t+V/2 t +V V Traditional Approach: Half-Half -- Sample at twice the rate of change: P = D= V/2 Definition: • X : Real-Time Data • V : Validity Interval Length • T : Trans Updating X • P: Period of T • D: Deadline of T • C: Computation Time of T P=D P=D t t t +V t+V/2 P more than V/2 & D less than V/2 ! Observation : Data validity can be guaranteed if Period + Deadline <= Validity Length Workload : C/ P= 2C/ V Problem : Imposes unnecessarily high workload
Deriving Deadlines and Periods:Intuition of More-Less Principle • Data validity can be guaranteed if • Period + Relative Deadline <= Validity Length(1) • To reduce the workload (C/P) imposed by T without violating (1) : • Increase periodto bemore than half of validity length • Decrease relative deadlineto be lessthan half of validity length • If relative deadline <= period, • deadline monotonic scheduling is an optimal fixed priority scheduling alg
For a set of transactions {Ti} 1 <= i <= m Validity Constraint (to ensure data validity) : Period + Deadline <= Validity Length More-Less Principle: Definition • Deadline Constraint (to reduce workload) : Computation Time <= Deadline <= Period • Schedulability Constraint (by deadline monotonic) : • Response time of the 1st instance<= Deadline • Note: 1st instance response time is the longest response time if all transactions start at same time Question: Is More-Less always better than Half-Half ?
More-Less (priority order: T1 > T2) Half-Half Parameters D P D P C V 1 2 T1 1.5 1.5 T1 1 3 T1 4 16 10 10 T2 T2 2 20 T2 < Utilization : 1/2 + 2/16 = 0.625 Utilization : 1/1.5 + 2/10 = 0.867 More-Less Principle: P & D Determining deadline and period of a transaction in More-Less: Deadline: D = Response time of the 1st instance; Period : P = Validity Length - Deadline; Does the priority order T2 > T1 produce same P and D? Is more-less always better than half-half ?
Theorem: {Ti} can be scheduled by (Half-Half , any fixed priority scheduling alg) {Ti} can be scheduled by (More-Less, Deadline Monotonic scheduling) The reverse is not true More-Less Better than Half-Half • Question: How to determine transaction priority order s.t. load is minimized under More-Less ?
Shortest Validity First • Shortest Validity First (SVF) • assign orders to transactions • in the inverse order of validity interval length • resolve ties in favor of a transaction with less slack (V- C) • is optimal under certain restrictions
Shortest Validity First: Summary • Restrictions: • S Ci <= min (Vj /2) • C2 - C1 <= 2(V2 - V1) (i.e., the increase of computation time is less than twice the increase in validity interval length), • If restrictions (1) & (2) hold, SVF is optimal • If only restriction (1) holds, SVF is near optimal • In general, SVF is a good heuristic (shown in experiments)