Performing impossible feats of XML processing with pipelining

Performing impossible feats of XML processing with pipelining XML Open 2004 Sean McGrath Propylon http://www.propylon.com http://seanmcgrath.blogspot.com

Contents • The pipelining philosophy • Major functional elements of pipelines • Some examples • Pipelining and Grids • Pipelining and Web Services/SOAs • Some anticipated objections (and answers) • Some musings • Some technology pointers

What is XML pipelining? • It is an architectural framework for developing robust, scaleable, manageable XML processing systems. • based on proven mechanical manufacturing patterns. Specifically: • Assembly Lines (divide and conquer) • Component assembly and component re-use

What is XML pipelining and why is it useful? • A way of thinking about systems that focuses on XML dataflows rather than object APIs. (This is critical and non-trivial focus-shift for many programmers!) • Why? Because pipelining provides a mechanical, inspiration-free, genius-free way of handling the mind-boggling complexity of complex XML transformation projects.

Pipelining Philosophy XML is all about complex hierarchical data structures…

Pipelining Philosophy Cars are complex, hierarchical structures Henry Ford’s Model T Ford Assembly Line – 1914

Pipelining Philosophy Lunch is a complex, hierarchical structure Lunch Assembly Line. NY, 2004

Pipelining Philosophy We are complex, hierarchical structures

Pipelining philosophy • What have these scenes got it common? • Complex construction of cars, tuna melts and tendons made possible and efficient through • assembly line manufacturing pattern of divide and conquer • re-usable component processes and component materials • Why not apply this approach to XML “manufacturing”?

Pipeline philosophy • Why does the assembly line approach work? • Transformation task decomposition • Re-usable transformation components • Transformation decomposition is the key to complexity management. Just ask: • Henry Ford • Herbert Simon (The Two Watchmakers – “The Architecture of Complexity”) • George Miller (7+/-2) • Adam Smith (An Inquiry into the Nature And Causes of the Wealth of Nations,1776) • Any electrical or chemical engineer.

Pipeline philosophy • Component re-use is the key to productivity • Ask any form of engineer (electrical, chemical etc.) apart from software engineers… • Component re-use remains a holy grail in software engineering • Pipelining is yet another attempt based on data transformation and data flow rather than algorithms

Pipeline philosophy • A lot of data processing for the forseable future will consist of XML to XML transformation • A lot of non-XML data processing can consist of XML to XML transformations with the addition of top and tail transformations to non-XML formats • An XML pipeliners mantra: • Get data into XML as quickly as possible • Keep it in XML until the last possible minute • Bring all your XML tools to bear on solving the data processing problem

Pipeline philosophy Input XML Output XML Top Transformation Tail Transformation Non-XML Input Non-XML Output

Pipeline philosophy • The philosophy hinges on the fact that every complex XML transformation can be broken down into a series of smaller ones than can be chained together

Pipeline philosophy • Only so many ways to re-arrange an XML tree structure • A finite number of fundamental transformations, from which all transformations can be derived

Pipeline philosophy • Starting point: data at time T conforming to “spec” A. Data at time T2 conforming to “spec.” B. • Transformation Analysis/Decomposition – decompose the problem of getting from A to B into independent XML in, XML out stages • Decide what transformation components you already have. • Implement the ones you don’t – make them re-usable for the next transformation project.

Pipeline philosophy • Transformation analysis & decomposition leads to • a series of small, manageable, “stand alone” problems with an XML input “spec” and an XML output “spec”. “Spec” = schemas + structure rules + narrative. • Can build, test, use and then re-use these transformation components • Very team development friendly – parallel development of loosely coupled components • Very debugging friendly – log2(n) “chops” to find any given problem.

Pipeline debugging Schema Delta 1 … Schema Delta N Schema A Schema B Input XML Output XML XML Delta 1 XML Delta N Top Transformation Tail Transformation Non-XML Input Non-XML Output

Pipeline philosophy • The answer to the SAX/DOM question is “mu”. (More on this later) • No such thing as “the” correct abstraction for processing XML • Pipeline approach means you can mix ‘n’match black-box components that internally use whatever paradigm best suited the problem • Lexical • SAX,STAX,DOM,XOM • COmega,XSLT, XQuery • XDuce, Pyxie, Java, C#, Groovy, Ruby, Haskell, WebIt! Etc. etc.

Sample Pipeline DB /CMS Character Set Mods Add Doctype + validate + strip doctype Lexical Re-arrange Elements Validation Lexical DOM Stats + FTP Schematron/ RelaxNG/ Rhino SQL Replace Jython XHTML Generate Java XSLT

Pipeline philosophy • Many XML transformations end up monolithic • Assertion : developers would use a more component based approach to XML processing if they did not have to write the plumbing (orchestration, exception handling) themselves • “Gee, this problem is complex. Maybe I’ll do it in multiple stages! Gee, now I have to orchestrate the stages somehow. Batch files/shell scripts/driver program – all ugly and error prone. Maybe I’ll just write a single program after all. Besides, it will run faster...”

Pipeline philosophy • “Professional developers spend 50 percent of their time writing plumbing” – Adam Bosworth • Pipelining promotes the creation of a reusable plumbing “layer” letting developers concentrate on the application in hand.

Philosophy Summary • Think flow - data processing == data transformation w.r.t. time – Michael Jackson • XML is the current runaway winner in the self-descriptive data stakes and a very good IDDL (Intermediate Data Description Language) for all types of data that are not natively XML based

Philosophy Summary • Inside every complex XML transformation is a sequence of simpler XML transformations trying to get out – a pipeline • Decomposed transformation: • new transformations + • already componentized transformations • -> Component Reuse Nirvana

Pipeline Philosophy Out Level 2 – Rudimentary orchestration In Out Level 1 - pipeline In Out Level 0 – transformation component In Out

Simple pipeline transformation component examples • Fundamental Operation – Rename Element • Rename • Input : <foo>baz</foo> • Output: <bar>baz</bar> foo bar baz baz

Simple pipeline transformation component examples • Fundamental Operation - Peel • Input : <foo><bar>baz</bar></foo> • Output: <foo>baz</foo> foo foo bar baz baz

Simple pipeline transformation component examples • Compound Operation - Matryoshka • Input: • <foo><bar>baz</bar></foo> • Output: • <foo></foo><bar></bar>baz foo bar foo bar baz baz

Simple pipeline transformation component examples • KlingonCloak • Input: • <foo><bar>baz</bar></foo> • Output: • <tag name=“foo”><tag name=“bar”>baz</tag></tag> foo tag type=“foo” bar tag type=“bar” baz baz

Simple pipeline transformation component examples • Reading a file is an XML to XML transformation • <file>lewisscarrol.xml</file> • <poem><line>Twas brillig, and the slithy tomes, did gyre and gimbal in the wave</line>…</poem>

Simple pipeline transformation component examples • Arithmetic is an XML to XML transformation • <expr>1 + 2</expr> • <res>3</res>

Simple pipeline transformation component examples • Unix pipe utilities e.g. tr • hello world • HELLO WORLD

A little orchestration in a transformation component • Conditionals are XML to XML transformation “tee junctions” triggered by XPaths if XPath TRUE branch In if XPath if XPath FALSE branch

Validation as a transformation component XML A XML A’ RelaxNG Schematron Jython/Java/JACL XComponent Input Output Validation Log Error

Sample Transformation Component Examples • Once you start thinking in terms of pipes – components appear everywhere: • Regular fragmentations • Doctype changer • Namespace normalizer • Character set transcoder • Hash generator • Architectural form processing • RelaxNG/Schematron etc

First objection • “It will be dog slow” or (stronger form): • “Re-usable tree transforming components won’t work in my shop – my XML files are too big to schlep around in strings, never mind DOMs!”

Document fulcra and the scatter/gather pattern • For any given transformation t to be performed on documents conforming to schema s, there is a fragment expression that can be used to chop each document into n pieces, on which t can be performed. • I call these points fulcra and are a function of (t,s)

Identifying Fulcra • For data-oriented XML, the fulcra often coincide with the “record” iteration in the XML schema and may be independent of t. • For document-oriented XML, the fulcra are much more dependent on t.

Document fulcra and scatter/gather pattern • Having identified the fulcra:- • Chop the input document into fragments – scatter phase • Perform t • Join all the processed fragments together to constitute the output document – gather phase • Three stage pipeline – scatter & gather either side of the core component

Input Doc Document Fulcra Scatter n fragments TIME Invoke t t t t t t n fragments Gather Output Doc

Document Fulcra • Note the data domain de-composition – SETI@Home meets XML markup. • Trivially parallelizable 

Document Fulcra • A good fulcra based scatter/gather will make performance head north faster, cheaper and with a high upper limit than any amount of hand-crafted, genius level XML coding of your transformations in horrid SAX or lexical parse mode. • Massive Parallelism will kill all von Neumann throughput arguments • Documents per second, not seconds per document – throughput is the true measure of XML processing speed • Document fulcra – Locality of reference (Denning) applies to XML processing (more on this later)

More objections (with more answers) • It will be slow • No it won’t - Premature optimization is the root of all evil! • Speed is a three headed monster. I’m old enough to have left the X axis and currently heading for Y through Z The 3 Axes to Speed

Some objections (with some answers) • Component based software? Harumph! We have heard that one before… • Pipelines are data flow based not API based (COM, VBX, CORBA) • Two pin interfaces and minimal “verbs” • The XML “payload” is what is important – not the API - RESTian

Revisiting the XSLT/DOM -> SAX non-sequiter • XSLT and DOM are memory bound – trade off between ease of use and resource usage – ease of use favoured • SAX is not memory bound – trade off between ease of use and resource usage – low resource usage favoured • On xml-dev users often advised to rewrite their apps using SAX! Ugh!

XSLT/DOM -> pipeline • Pipelines and scatter/gather allow you to keep the ease of use of XSLT/DOM with the finite resource utilization of SAX • As long as you can identify a good fulcrum function • They exist more often than not • If they exist, they are very easily found and “drop out” of document analysis – eg: xpath expressions in XSLT stylesheet templates

Pipelining and Grids • Grid Technologies – computational power “on tap” (http://www.gridforum.org) • A match made in heaven (bandwidth permitting)

An XML Processing Grid – on demand Out In Out DMZ

Grids - caveats • For large data volumes it is simple not feasible to shunt the data over the wire – Jim Gray • Organizations are sensitive about their data going beyond firewalls • Pay-per-use “racks” in your back-office a better bet. – Rent a grid the way you would rent a chainsaw.

A Service Oriented Architecture “service” = XML transformation with side optional effects

Performing impossible feats of XML processing with pipelining

Performing impossible feats of XML processing with pipelining

Presentation Transcript

XML Processing in

Pipelining and Vector Processing

Processing XML with Java

Processing XML Documents

Processing XML Streams with Deterministic Automata

XML Security Processing With VTD-XML

Processing XML with Java

PIPELINING AND VECTOR PROCESSING

XML Processing with DOM and SAX

Query Processing with XML

XML Processing Performance Comparison with XPB4J

Processing XML with Java

Query Processing of XML Data

XML Query Processing

Processing XML with Java

Processing XML

Processing XML with Java

PIPELINING AND VECTOR PROCESSING

XML Stream Processing

PIPELINING AND VECTOR PROCESSING

Pipelining and Vector Processing

Pipelining vs. Parallel processing