430 likes | 513 Views
Document Value Model: Value-oriented XML processing for the internet. Fritz Henglein DIKU, University of Copenhagen henglein@diku.dk. Abstract.
E N D
Document Value Model: Value-oriented XML processing for the internet Fritz Henglein DIKU, University of Copenhagen henglein@diku.dk
Abstract • XML is all the rage. How do we store and process XML documents, however? In this talk we present XML Value Store, a persistent distributed (peer-to-peer) storage manager with a value-oriented interface, the Document Value Model (DVM), for XML documents whose parts may be distributed around the net and even moving around (such as on a cell phone at 140 km/h on a motorway). We compare DVM with existing XML processing languages and specifically the W3-consortium based Document Object Model (DOM). We argue that, apart from a series of technical advantages, the central benefit of DVM is a simplified programming model that lets the programmer focus on application logic, and the XML middleware on persistence management, caching, replication, coalescing, encryption, distribution, lookup, routing and internet data transport. We finally sketch a simple extension of XML Value Store with remote execution. Together with storing code in the XML Value Store this lets users send queries to remote XML Value Store for execution and promises highly scalable grid computing functionality with a simple, problem-oriented programming model.
Abstract (long) • XML (eXtended Markup Language) is emerging as the universal language for representing semi-structured data for distributed storage and information interchange on the internet and as such is destined to be the universal tissue -- the lingua france -- for interoperable web services and databases interconnecting the internet. This makes XML processing an undisputed growth industry. But how is it done? We give examples of processing XML documents using domain-specific languages XSLT and XQUERY, and general purpose interfaces SAX an DOM for manipulating structure and contents of XML documents. The latter, Document Object Model, is based on object-oriented programming principles in which tree nodes are mutable objects, with associated methods for imperatively updating their state. Furthermore, each tree node in DOM is equipped with a (single) parent reference and a (single) document root reference, which means DOM-nodes cannot be shared and cannot be moved to other documents. Furthermore, in practice a DOM program starts by parsing an XML document completely into memory before performing any processing on it, however little part of the document is actually required for that, and it finally pretty-prints its tree structure and writes the whole document out on disk, however little of it is actually changed since it was read. In this talk we present DVM (Document Value Model), a value-oriented interface for processing XML documents, and XML Value Store, a distributed (peer-to-peer) storage manager for storing XML documents in parsed form. We illustrate XML documents and document nodes, based on treating nodes as values -- immutable objects. The immutability of nodes in DVM allows aggressive and safe use of sharing through value references -- universal pointers to immutable objects stored anywhere -- in the XML Value Store. This has a number of technical advantages over DOM: document nodes are sharable, also across multiple documents; loading and saving of nodes into/from memory is done by need, that is only those nodes needed by a computation are loaded and only those not already saved on disk are actually saved; node pointers point to nodes whereever they are stored, even they move around frequently; parsing and unparsing for persisting (storing on disk) are eliminated since the XML tree, not its linearized form is stored; nodes can be cached and replicated aggressively for performance without concern for (cache) incoherence; identical nodes (document parts) are only saved once on a disk as opposed to multiple times, even if the different users accidentally store the same are different; no parent and root nodes are stored, yet navigation to parent and root are still possible. The main advantage, we argue, of this is that these 'generic' (computer sciency) data management concerns can be and are handled in the XML Value Store, not in the programmer's application logic. A planned extension of XML Value Store is the addition of a 'higher-order' interface, which allows remote execution. This allows sending scripts (queries) to an XML Store for remote execution and promises to provide scalable grid computing functionality with a simple, problem-oriented programming model.
Abstract (for functional programmers) • This talk is basically about programming with (disk and network) I/O in a functional, high-level fashion.
Overview (buzzy version) • OOP • VOP • XML • DOM • DVM • X
Overview • OOP: object-oriented programming and distribution and mobility • VOP: value-oriented programming • XML: XML processing models and languages • DOM: Document Object Model • DVM: Document Value Model • X: The Unknown (future work)
Overview • OOP • VOP • XML • DOM • DVM • X General theme: programming with values (immutable objects) ... and objects (and carefully distinguishing between them).
OOP – or rather:imperative programming • Basic model of programming: • primitive in-place update operations:obj.field := obj2ref • compound update operations: controlled sequential execution of updates; e.g.(for int i = 0; i < arr.size; i++) arr[i] := newVal(i);
Imperative programming theme • Goal: Global state transition from State0 to Staten; State0 is destroyed. • Implementation (ephemeral state updates): State0 -> ... -> Statei -> Staten of primitive state transitions, where • each primitive update destroys the previous state
Consequence 1 • software component interfaces are state-oriented and stateful: • which operations are available depends on history of operations executed in the • responses from components depend on history of operations executed • Example: Unix file I/O • NB: Operations on such components are not necessarily atomic (or even recoverable)
input(f) process(s) output(f) Copy-and-update programming • Note: • data get copied • they are not always coherent • they get copied again input(f) process(s) output(f)
Why (and when) it works (well) • no concurrent access to file • sequential and synchronous programming (control over sequence of state changes) • no partial failures: atomic abort due to single point of failure (single-process execution on single processor) • no replication of stateful data • ‘random’ access to location of data (rapid access no matter where they are stored)
Consequence 2 • Software/hardware component APIs are copy-oriented: data referenced by a pointer get copied before being manipulated to ensure integrity • Example: Modern operating systems are based on separation of address spaces; require copying of data or delegation of tasks (ask the other process to do something for me)
Imperative programming: Problem areas • caching and replication require heavy coherence protocols or different states are ‘observable’ by clients and users • e.g. file save under NFS (wait for 30 seconds!!) • atomic (commit of compound) update is difficult to achieve in the presence of partial failures; • rollback is not ‘naturally’ supported, but normally required in situations where (atomic) updates can fail • coalescing identical data (storing data only once) cannot be done (easily)
Imperative programming: Problem areas... • programming is mostly synchronous to control degree of nondeterminism due to concurrency • access to storage locations is not ‘random’ (no modern file system does what’s shown before) • access to updatable objects is typically ‘location’-based; mobile objects are not ‘naturally’ supported • lots of data stored multiple times
...but, of course • Updatable objects are excellent for propating information to an arbitrary number of clients (to any caller of the object, needn’t even know or keep track number or identity of callers)
Properties of distributed (mobile) systems • Partial failures • can’t even distinguish network failures from computing node failures • Concurrency • Difficult (exact) synchronization of processes • Widely varying access latency: • rpc may block arbitrarily long time
Techniques for battling these problems • Caching, replication, memoization • (buffered) asynchronous message passing • relaxed or indeterminate semantics • time-outs • observational differences between processes running on same machine or on different machines Not good for mobile code!
These ops commute! Central problem • ...not reading (loading) • ...not writing (saving, allocating) • but updating (overwriting) Breaks commuting Note: The more updating, the less operations commute and the more their execution needs to be controlled (synchronized).
VOP: Value-oriented programming • Programming with: • arbitrarily “large”values (immutable objects), stored not only in RAM, but also on disk and on the net • location-independent value references (short, probabilistically unique identifiers of values, wherever they are stored) – can be thought of as light-weight proxies for actual (big) values • plus “small” statefulcells (mutable objects) and cell references, incl. • wait-free registers with consensus number infinity (e.g., compare-and-swap registers)
Benefits/goals • Value references: • efficient sharing of immutable data • efficient message passing • Arbitrarily large values: • programmatic support of efficient atomic update: build new (global) state as value, then perform update atomically by assigning value reference of new value to register holding present state. • Small registers: • guaranteeing atomic update, with no (or minimal) locking • wait-freeness: ensure ‘progress’ (doesn’t get blocked forever or for too long) of each client, even in the face of partial failures elsewhere
XML • XML info set (“(Minimal) XML tree”): labeled ordered tree, with • character data at the leaves • key/value pairs (attributes) at the internal nodes • XML document: • linearized representation of XML tree based on pre/post-order traversal of XML tree
XML example <?xml ...?> <book> <author> Susanne Staun </author> <title> Mit smukke lig </title> </book> book author title Susanne Staun Mit smukke lig
Why the extra pointers? Document Object Model (DOM) book author title Susanne Staun Mit smukke lige
DOM characteristics • object-oriented: nodes are objects, have methods that, amongst others, update their properties (children, attributes, parent pointer) • purely tree oriented: each node has at most one predecessor, no node sharing • cloning: is used to copy a node into another place of a document
DOM specification • Specified by W3C, see www.w3.org/DOM • Specification has 3 levels (specifying more and more functionality for document objects)
Programming with DOM • Typical scenario: • Read linearized XML document from file or network ‘pipe’ (socket). • Parse XML document into an in-memory tree data structure corresponding to DOM • Traverse and manipulate in-memory structure • Unparse in-memory structure to linearized XML document • Write out XML document through file or network pipe interface.
book title Blå hav Document Value Model (DVM) Sharing!! book author title Susanne Staun Mit smukke lige Isn’t that just a picture of the XML tree model?
Navigation • How do we navigate in an XML tree without parent and root pointers? • DOM: current node contains complete navigation state, including parent and root-pointers • DVM: navigation state characterized by [n0, ..., nk] where n0 is root and nk is’current’ node • allows navigation to parent and root, just as in DOM • does not require any storage in nodes, as in DOM • works also for shared nodes (”bread crumbs” method for finding one’s way back in a labyrinth [dag])
DVM: basic interface • The type of XML trees is an inductive datatype • Basic constructors (”factory” methods): • Combine attributes, child list, tag into new element node • Make chardata node from string • Basic deconstructors (projections): • Get attributes, child list, tag, chardata • Cells (updatable nodes): • setState, getState: atomic operations
DVM: general interface • Equip nodes with the ability to receive and apply any function to itself or a function that is applied to every of its subnode • Called Visitor pattern in OO design • Corresponds to unique homomorphism/type elimination rule (”fold”) known from algebraic datatypes/type theory • Lets nodes not only receive single ”commands” for execution, but whole programs.
(functional) update operation book title Blå hav Share-and-create style updating book author title Susanne Staun Mit smukke lig
book title Blå hav Universal references Never loaded from disk! book author title Susanne Staun Mit smukke lig RAM disk storage
Universal references • Value references are location independent: • always designate value, not where value is stored • require routing service to be resolved! • Value references can point from any place to any place: • from RAM to disk, from disk to disk, from disk to network, from disk to RAM (!)...
XML Value Store • Distributed persistence manager for XML elements • Peer-to-peer architecture • Global name server for binding and rebinding value references to human-readable names • Rebinding: bindings can be updated atomically.
XML Store: Basic interface • Load value: Value load(ValueRef vr) • Save value:ValueRef save(Value v) • (That’s it) • Security/authentication not addressed yet: • extended access control based interface • encrypted storage
XML Store: General interface • The visitor interface allows nodes to receive any function and apply it to its state. • Let’s do the same with the XML value store interface: Extending it with a visitor interface allows XML value stores to receive arbitrary code and execute it. • Allows implementation of: • query languages • general remote processing (e.g. for ‘grid’ computing)
Code as values • Program code = value: Code can be stored in the XML store. • Remote execution then involves passing a value reference to the code to the receiver. If the receiver already has the corresponding value (code) – e.g. due to caching in the XML value store, no further communication is necessary; otherwise the value is requested (pulled in) by the receiver.
XML Value Store architecture • Base configuration: each peer is a single component made up of: • ”raw” disk manager • network proxy for group of remote XML-store peers • group communication presently based on: • IP-multicast (Pedersen/Tejlgaard 2002), or • Chord-routing protocol (Baumann/Fennestad/Thorn 2002)
Configurable XML Stores • Goal: Clients can construct XML Stores by constructing them from: • primitive XML stores (disk manager, in-RAM manager, adapters to databases, file managers etc.), and • XML store constructors (“decorators”): • caching reads and writes • asynchronous load/save • buffered load/save requests • encryption/decryption • Target date: August 2003
A simple challenge • Write a little program that implements a dictionary, e.g. for looking up phone numbers, and inserting and updating records. • It should work on the net (concurrent access). • It should work for a while (also after the machine has been taken down and restarted). • Surprisingly more complex to program than the routines you learned in algorithm class...
Summary • Value-oriented model for manipulating semistructured data: • supports light-weight caching, replication, asynchronous computing in the “XML middleware” • Configurable XML middleware (client can order the properties one wants from the XML store) • Separation of program logic (in the client code) from generic deal • Encourages clients to write transaction safe code programmatically
More info • Website: www.plan-x.org • Presently contains material from seminar on “distributed and mobile data and software” (including lots of references not mentioned here) • Email: henglein@diku.dk