Efficient XML Interchange

Efficient XML Interchange

XML • Why is XML good? • • A widely accepted standard for data representation • • Fairly simple format • • Flexible • It’s not used by everyone, but it’s used by enough people to make for a rich tools environment • It’s flexible enough to be used in lots of contexts • It’s text based and human readable, which makes it a good archival format

XML • XML in 10 points • http://www.w3.org/XML/1999/XML-in-10-Points • Includes (3) “XML is meant to be read”, and (4) “XML is verbose by design” • XML can (but should not be) read by humans, and is not very compact

XML • These design principles also make it very difficult to use XML in some environments • • Wireless military links: low bandwidth • • Mobile devices: battery life limitations • • Processing efficiency: it can take CPU cycles to parse XML • • Data binding

Limitations • Lots of ships have 64 Kbit/sec at best. It is problematic to ship XML across these links • CPUs are on Moore’s law curve, but battery power is limited by the state of chemistry. We can’t assume that faster processors will save us. Lots of applications for hand held devices with limited battery power (cell phones, etc.) • Cell phones don’t necessarily have strong CPUs, so parsing XML can be expensive relative to other tasks

Data Binding • This is a more subtle problem. • <Point x=“1.0” y=“2.0”/> • How do you convert this to an object? You need to parse the string “1.0”, then convert it to a binary representation • It’s the difference between • string x; • And • float x;

Data Binding • Typically something comes in from the wire, and you have to do the Java equivalent of • Float.parseFloat(“1.0”); • This is expensive when working with numeric-heavy documents • It is much more efficient to keep the value X in a binary representation in the document, then simply read it on the receiving side

Efficient XML Interchange • EXI relaxes some of the requirements of XML in order to be more compact, faster to parse, and have better data binding characteristics • • Relax the “human readable” requirement • • Allow binary data • What you get is an alternate encoding of the XML infoset that is more compact, faster to parse, and allows deployment in new environments that XML previously could not be deployed in

EXI • EXI is being developed by a W3C working group and is on a standards track. The hope is that this will become a W3C-blessed encoding of the XML infoset • Working group draft now working its way to approval. • Need multiple implementations, blessed by W3C technical architecture group, approval by other W3C working groups (encryption, processors, etc.)

EXI • • Represents the same data as an XML document, only in a more efficient encoding • • Minimal impact on other XML technologies, such as encryption • • More efficient to parse, better data binding performance

EXI • http://www.w3.org/XML/EXI • Includes file format specification, primer on EXI, best practices • Note that one thing that is NOT specified is an API for accessing the data. This is an important and significant omission • Lack of a standardized typed API means we still have to go through string representations

Typed API • What is meant by a typed API? • DOM and SAX return string values: • Attr anAttribute; • … • // DOM returns a String attribute value here • String val = anAttribute.getValue() • And then we need to convert val into a float via • Float aFloat = Float.parseFloat(val);

Typed API • But what we often want is the value specified in the schema: • Float aFloat = anAttribute.getFloat(); • There are proposals for a generalized typed API, but it is not part of this standard

EXI • EXI has several options to handle different situations. • • You have an XML document and a schema • • You have an XML document but no schema • • You have an XML document, and a schema that almost, but not quite, matches the document

Element and Attribute Names • Tag names take up a lot of space, and can be somewhat expensive to parse • <Name first=“James” last=“Madison”> • <State>Virginia</State> • </Name> • Count up the characters used for markup here:31/55 ~=50-60% of file size for markup tags • If we replace the character tags with numeric stand-ins we can get much more compact, and it will be faster to parse

Schema-Informed • If you have a schema, that gives you type information about the XML document. You know that <foo x=“1.0”/> means the x is a float value rather than a string, because the schema tells you that. • That means you can store the “1.0” value in a binary format, which is generally more compact and has the potential to have better data binding with a typed API

Schemaless • What if you don’t have a schema? This means you can’t exploit type information. But EXI should support this situation, because it should be a general solution • EXI handles this by replacing repeating strings with a compact identifier

Schemaless • <Address town=“Monterey” zip=“93943”/> • The strings “Monterey” and the zip code are likely to be repeated many times in an XML document. We can create a table of these values, and then use the table ID rather than the whole string

“Almost” Schemas • If you have a document that doesn’t quite match the schema, EXI can take a forgiving attitude. It uses the schema to encode the types it knows about, and uses strings and string table identifiers to handle the ones not described by the schema

Implementations • As of now there is one implementation of the draft spec, Efficient XML from Agile Delta (http://www.agiledelta.com) • Other open source projects underway, and some commercial projects • The standards process requires that multiple independent implementations be available before the standard is approved

Results • Example: Distributed Interactive Simulation (DIS) is an IEEE standard for modeling and simulation. It is a binary standard that contains (x,y,z), velocity, acceleration, and other numeric-heavy data • We did an XML representation of the binary DIS standard

Results

Results • • Somewhat better size than the original binary format. The exact size varies somewhat depending on the numeric data, while the original binary format is always the same size. Exi seems to be consistently better, though • • AND it is marked up in a way that makes it equivalent to an XML file. This means we can easily access all the tools of the XML ecosystem by simply converting it to a text XML representation

Conclusions • Replace all text XML with EXI? No! EXI is intended to expand the use of XML into use cases that XML could not service. XML mostly does fine in its existing environment • EXI can be used to XML-ify existing binary protocols and get slightly better performance with greatly increased interoperability (no one knows DIS binary, everyone knows XML) • Next great frontier: typed XML APIs

Efficient XML Interchange