330 likes | 672 Views
Sensor Data Management and XML Data Management. Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008. Administrivia. By next Tuesday, please email me with a status report on your project … We are well under a month from the deadline!
E N D
Sensor Data Managementand XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 19, 2008
Administrivia • By next Tuesday, please email me with a status report on your project • … We are well under a month from the deadline! • For next time: • Please read & review the TurboXPath paper
Sensor Networks: Target Platform • Most sensor network research argues for the Berkeley mote as a target platform: • Mote: 4MHz, 8-bit CPU • 128B RAM (original) • 512B Flash memory (original) • 40kbps radio, 100 ft range • Sensors: • Light, temperature, microphone • Accelerometer • Magnetometer http://robotics.eecs.berkeley.edu/~pister/SmartDust/
Sensor Net Data Acquisition • First: build routing tree • Second: begin sensing and aggregation
Sensor Net Data Acquisition (Sum) 5 5 5 5 5 5 5 5 5 5 5 5 7 8 5 5 5 5 • First: build routing tree • Second: begin sensing and aggregation (e.g., sum)
Sensor Net Data Acquisition (Sum) 5 5 15 5 5 10 5 20 5 5 5 25 5 10 5 5 85 20 5 5 5 5 5 60 13 8 55 18 7 8 35 30 23 5 5 5 5 • First: build routing tree • Second: begin sensing and aggregation (e.g., sum)
Sensor Network Research • Routing: need to aggregate and consolidate data in a power-efficient way • Ad hoc routing – generate routing tree to base station • Generally need to merge computation with routing • Robustness: need to combine info from many sensors to account for individual errors • What aggregation functions make sense? • Languages: how do we express what we want to do with sensor networks? • Many proposals here
A First Try: Tiny OS and nesC • TinyOS: a custom OS for sensor nets, written in nesC • Assumes low-power CPU • Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled) • Applications built from “components” • Basically, small objects without any local state • Various features in libraries that may or may not be included • interface Timer { command result_t start(char type, uint32_t interval); command result_t stop(); event result_t fired();}
Drawbacks of this Approach • Need to write very low-level code for sensor net behavior • Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand • Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …
An Alternative • “Much” of the computation being done in sensor nets looks like what we were discussing with STREAM • Today’s sensor networks look a lot like databases, pre-Codd • Custom “access paths” to get to data • One-off custom-code • So why not look at mapping sensor network computation to SQL? • Not very many joins here, but significant aggregation • Now the challenge is in picking a distribution and routing strategy that provides appropriate guarantees and minimizes power usage
TinyDB and TinySQL • Treat the entire sensor network as a universal relation • Each type of sensor data is a column in a global table • Tuples are created according to a sample interval (separated by epochs) • (Implications of this model?) • SELECT nodeid, light, tempFROM sensorsSAMPLE INTERVAL 1s FOR 10s
Storage Points and Windows • Like Aurora, STREAM, can materialize portions of the data: • CREATE STORAGE POINT recentlight SIZE 8AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s) • and we can use windowed aggregates: • SELECT WINAVG(volume, 30s, 5s)FROM sensorsSAMPLE INTERVAL 1s
Events • ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s
Power and TinyDB • Cost-based optimizer tries to find a query plan to yield lowest overall power consumption • Different sensors have different power usage • Try to order sampling according to selectivity (sounds familiar?) • Assumption of uniform distribution of values over range • Batching of queries (multi-query optimization) • Convert a series of events into a stream join with a table • Also need to consider where the query is processed…
Dissemination of Queries • Based on semantic routing tree idea • SRT build request is flooded first • Node n gets to choose its parent p, based on radio range from root • Parent knows its children • Maintains an interval on values for each child • Forwards requests to children as appropriate • Maintenance: • If interval changes, child notifies its parent • If a node disappears, parent learns of this when it fails to get a response to a query
Query Processing • Mostly consists of sleeping! • Wake briefly, sample, and compute operators, then route onwards • Nodes are time synchronized • Awake time is proportional to the neighborhood size (why?) • Computation is based on partial state records • Basically, each operation is a partial aggregate value, plus the reading from the sensor
Load Shedding & Approximation • What if the router queue is overflowing? • Need to prioritize tuples, drop the ones we don’t want • FIFO vs. averaging the head of the queue vs. delta-proportional weighting • Later work considers the question of using approximation for more power efficiency • If sensors in one region change less frequently, can sample less frequently (or fewer times) in that region • If sensors change less frequently, can sample readings that take less power but are correlated (e.g., battery voltage vs. temperature)
The Future of Sensor Nets? • TinySQL is a nice way of formulating the problem of query processing with motes • View the sensor net as a universal relation • Can define views to abstract some concepts, e.g., an object being monitored • But: • What about when we have multiple instances of an object to be tracked? Correlations between objects? (Joins) • What if we have more complex data? More CPU power? • What if we want to reason about accuracy?
XML: A Format of Many Uses • XML has become the standard for data interchange, and for many document representations • Sometimes we’d like to store it… • Collections of text documents, e.g., the Web, doc DBs • … How would we want to query those? • IR/text queries, path queries, XQueries? • Interchanging data • SOAP messages, RSS, XML streams • Perhaps subsets of data from RDBMSs • Storing native, database-like XML data • Caching • Logging of XML messages
XML: Hierarchical Data and Its Challenges • It’s not normalized… • It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing • Contrast with E-R diagrams • How to store the hierarchy? • Complex navigation may include going up, sideways in tree • Updates, locking • Optimization • Also, it’s ordered • May restrict order of evaluation (or at least presentation) • Makes updates more complex • Many of these issues aren’t unique to XML • Semistructured databases, esp. with ordered collections, were similar • But our efforts in that area basically failed…
Two Ways of Thinking of XML Processing • XML databases (today) • Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …) • Query optimization • “Streaming XML” (next time) • RDBMS XML export • Partitioning of computation between source and mediator • “Streaming XPath” engines • The difference is in storage (or lack thereof)
XML in a Database • Use a legacy RDBMS • Shredding [Shanmugasundaram+99] and many others • Path-based encodings [Cooper+01] • Region-based encodings [Bruno+02][Chen+04] • Order preservation in updates [Tatarinov+02], … • What’s novel here? How does this relate to materialized views and warehousing? • Native XML databases • Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …) • Updates and locking • Query optimization (e.g., that on Galax)
Query Processing for XML • Why is optimization harder? • Hierarchy means many more joins (conceptually) • “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op • Though typically parent-child relationships • Often don’t have good measure of “fan-out” • More ways of optimizing this • Order preservation limits processing in many ways • Nested content ~ left outer join • Except that we need to cluster a collection with the parent • Relationship with NF2 approach • Tags (don’t really add much complexity except in trying to encode efficiently) • Complex functions and recursion • Few real DB systems implement these fully • Why is storage harder? • That’s the focus of Natix, really
The Natix System • In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS • Physical layout • Indexing • Locking/concurrency control • Logging/recovery
Physical Layout • What are our options in storing XML trees? • At some level, it’s all smoke-and-mirrors • Need to map to “flat” byte sequences on disk • But several options: • Shred completely, as in many RDBMS mappings • Each path may get its own contiguous set of pages • e.g., vectorized XML [Buneman et al.] • An element may get its 1:1 children • e.g., shared inlining [Shanmugasundaram+] and [Chen+] • All content may be in one table • e.g., [Florescu/Kossmann] and most interval encoded XML • We may embed a few items on the same page and “overflow” the rest • How collections are often stored in ORDBMS • We may try to cluster XML trees on the same page, as “interpreted BLOBs” • This is Natix’s approach (and also IBM’s DB2) • Pros and cons of these approaches?
Challenges of the Page-per-Tree Approach • How big of a tree? • What happens if the XML overflows the tree? • Natix claims an adaptive approach to choosing the tree’s granularity • Primarily based on balancing the tree, constraints on children that must appear with a parent • What other possibilities make sense? • Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages
Split point in parent page Example Note “proxy” nodes
That Was Simple – But What about Updates? • Clearly, insertions and deletions can affect things • Deletion may ultimately require us to rebalance • Ditto with insertion • But insertion also may make us run out of space – what to do? • Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree • Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order
Does this Help? • According to general lore, yes • The Natix experiments in this paper were limited in their query and adaptivity loads • But the IBM people say their approach, which is similar, works significantly better than Oracle’s shredded approach
There’s More to Updates than the Pages • What about concurrency control and recovery? • We already have a notion of hierarchical locks, but they claim: • If we want to support IDREF traversal, and indexing directly to nodes, we need more • What’s the idea behind SPP locking?
Logging • They claim ARIES needs some modifications – why? • Their changes: • Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion • Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL • “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree • A few minor tweaks to minimize undo/redo when only one transaction touches a page
Assessment • Native XML storage isn’t really all that different from other means of storage • There are probably some good reasons to make a few tweaks in locking • Optimization remains harder • A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking