190 likes | 203 Views
This paper discusses encoding, efficiency, and classification techniques for high-performance content-based event routing with XML data binding in the context of the Programming Systems Lab at Columbia University.
E N D
XML Data Binding:Encoding for High-Performance Content-Based Event Routing Gail Kaiser Phil Gross Columbia University Programming Systems Lab
Overview • PSL Intro • MEET Project • Encoding Conversion Efficiency • Encoding Size Efficiency • Encoding Classification Efficiency
Programming Systems Lab • “PSL conducts research on Web technologies, collaborative work, virtual worlds, process/workflow, extended transaction models, software development environments and tools, software engineering, information management, and distributed programming systems” • Lately, lots of XML stuff
PSL XML-related Research • FlexML: Flexible XML • Open-ended XML streams that may include “new” tags • Dynamic schema and semantics discovery and composition • XUES: XML-based Universal Event Service • Event Packager: Data mining over XML structured data • Event Distiller: XML event poset pattern matching • Learning new application-domain events to recognize • DISCUS: Decentralized Information Spaces for Composition and Unification of Services • Rapid and secure application composition using Web Services • Trust Evolution: PGP Trust + KeyNote + real-world business
MEET • Multiply Extensible Event Transport • Content-based multicast routing • Must be efficient enough for embedded and high-performance applications
MEET Motivations • Personal Life Recorder (sensor oriented) • GroupWork Recorder (computer/DB oriented) • Parallel/Grid computing • Distributed simulation • Battlefield C4I • Last, but not least: • Dissertation submission
Machine A Relational Machine B XML Relationship to Other Work • Generally modeling communication like • What actually goes over the line is afterthought • But with N-Way Internet-scale communication • Millions of publishers and subscribers • We can (must!) do better than ASCII text… • Line speed => ≈250 assembly instructions per packet
MEET Extensibility • Want to scale up, to millions of pubs and subs • Want to scale down, to embedded and wireless • No single solution satisfactory at all scales • Composed of hot-swappable subsystems • Router, transports, clock/causality, types, etc.
Why Types • Event data is not just an opaque bag of bits • Subscriptions are Boolean functions over events • Type safety would be nice • What type system to use?
Initial MEET Type Design • Initial design calls for supporting Java, C#, and XML Schema defined objects “out of the box” • XML Schema used as Ur-language/Esperanto for conversions • Subscriptions are arbitrary boolean functions on datatypes • XML Schema is not ideal ur-type • Excessively complex, verbose, etc.
Encodings for Efficiency • Java, C#, XML, ASN.1 have well-defined but proprietary encodings for instances • Would be nice to have an independent encoding scheme with some desirable properties missing from the above • Fast serialization/deserialization • Elimination of redundant information from message sequences • Data organized for rapid classification/routing
Conversion Efficiency • Need to get to and from wire format as fast as possible • Leverage homogeneity to eliminate unnecessary conversions, e.g., network byte order • ECho system from Eisenhauer et. al., Georgia Tech • Using “native data” for ultra-low latency • Necessary for HPC
Size Efficiency • Ideal for single message is self-describing data • With multiple messages of same type, one can pull out redundant type info, e.g., schema • Goal is to go further: If 90% of content of messages is the same, generate a new subtype with fixed values • From self-describing to all-schema is a continuum
Classification Efficiency • When bits start arriving serially at the router, would like to begin cut-through routing as soon as possible • Avoid the curse of IP/IPv6: source address first • Want key routing bits as close to the front as possible • Want data in fixed locations
Fast Classifying: First Things First • In the packet, type info first (after magic) • Would like to represent type codes as bit string with “most significant” info e.g. parent type first, followed by subtype identifier, sub-subtype, etc. • Need access to type hierarchy • Popular classification fields at the front • Need to tag with popularity metadata • “subscribers will want to select on me”
Fast Classifying: Fixed Positions • Would like to avoid scanning through long or variable-length fields • Long/Variable data needs to be in a separate channel/section • Primitives and fixed-length references at the front • References point into data section • Classifier can jump large, uninteresting data quickly
Plus: Schema Format • We’d like the schema format to be amenable to programmatic manipulation and analysis • For instance, when negotiating formats, we’d like to be able to compute how our original format offer differs from the counter-offer • XML Schema is pretty good for this
Conclusions • Efficient instance transfer is an interesting case for data-binding • Special needs for efficiency • But we can negotiate our own format among the communicating parties • Some explicit support for this in a general data-binding solution could help acceptance