620 likes | 781 Views
Performing impossible feats of XML processing with pipelining. XML Open 2004 Sean McGrath Propylon http://www.propylon.com http://seanmcgrath.blogspot.com. Contents. The pipelining philosophy Major functional elements of pipelines Some examples Pipelining and Grids
E N D
Performing impossible feats of XML processing with pipelining XML Open 2004 Sean McGrath Propylon http://www.propylon.com http://seanmcgrath.blogspot.com
Contents • The pipelining philosophy • Major functional elements of pipelines • Some examples • Pipelining and Grids • Pipelining and Web Services/SOAs • Some anticipated objections (and answers) • Some musings • Some technology pointers
What is XML pipelining? • It is an architectural framework for developing robust, scaleable, manageable XML processing systems. • based on proven mechanical manufacturing patterns. Specifically: • Assembly Lines (divide and conquer) • Component assembly and component re-use
What is XML pipelining and why is it useful? • A way of thinking about systems that focuses on XML dataflows rather than object APIs. (This is critical and non-trivial focus-shift for many programmers!) • Why? Because pipelining provides a mechanical, inspiration-free, genius-free way of handling the mind-boggling complexity of complex XML transformation projects.
Pipelining Philosophy XML is all about complex hierarchical data structures…
Pipelining Philosophy Cars are complex, hierarchical structures Henry Ford’s Model T Ford Assembly Line – 1914
Pipelining Philosophy Lunch is a complex, hierarchical structure Lunch Assembly Line. NY, 2004
Pipelining Philosophy We are complex, hierarchical structures
Pipelining philosophy • What have these scenes got it common? • Complex construction of cars, tuna melts and tendons made possible and efficient through • assembly line manufacturing pattern of divide and conquer • re-usable component processes and component materials • Why not apply this approach to XML “manufacturing”?
Pipeline philosophy • Why does the assembly line approach work? • Transformation task decomposition • Re-usable transformation components • Transformation decomposition is the key to complexity management. Just ask: • Henry Ford • Herbert Simon (The Two Watchmakers – “The Architecture of Complexity”) • George Miller (7+/-2) • Adam Smith (An Inquiry into the Nature And Causes of the Wealth of Nations,1776) • Any electrical or chemical engineer.
Pipeline philosophy • Component re-use is the key to productivity • Ask any form of engineer (electrical, chemical etc.) apart from software engineers… • Component re-use remains a holy grail in software engineering • Pipelining is yet another attempt based on data transformation and data flow rather than algorithms
Pipeline philosophy • A lot of data processing for the forseable future will consist of XML to XML transformation • A lot of non-XML data processing can consist of XML to XML transformations with the addition of top and tail transformations to non-XML formats • An XML pipeliners mantra: • Get data into XML as quickly as possible • Keep it in XML until the last possible minute • Bring all your XML tools to bear on solving the data processing problem
Pipeline philosophy Input XML Output XML Top Transformation Tail Transformation Non-XML Input Non-XML Output
Pipeline philosophy • The philosophy hinges on the fact that every complex XML transformation can be broken down into a series of smaller ones than can be chained together
Pipeline philosophy • Only so many ways to re-arrange an XML tree structure • A finite number of fundamental transformations, from which all transformations can be derived
Pipeline philosophy • Starting point: data at time T conforming to “spec” A. Data at time T2 conforming to “spec.” B. • Transformation Analysis/Decomposition – decompose the problem of getting from A to B into independent XML in, XML out stages • Decide what transformation components you already have. • Implement the ones you don’t – make them re-usable for the next transformation project.
Pipeline philosophy • Transformation analysis & decomposition leads to • a series of small, manageable, “stand alone” problems with an XML input “spec” and an XML output “spec”. “Spec” = schemas + structure rules + narrative. • Can build, test, use and then re-use these transformation components • Very team development friendly – parallel development of loosely coupled components • Very debugging friendly – log2(n) “chops” to find any given problem.
Pipeline debugging Schema Delta 1 … Schema Delta N Schema A Schema B Input XML Output XML XML Delta 1 XML Delta N Top Transformation Tail Transformation Non-XML Input Non-XML Output
Pipeline philosophy • The answer to the SAX/DOM question is “mu”. (More on this later) • No such thing as “the” correct abstraction for processing XML • Pipeline approach means you can mix ‘n’match black-box components that internally use whatever paradigm best suited the problem • Lexical • SAX,STAX,DOM,XOM • COmega,XSLT, XQuery • XDuce, Pyxie, Java, C#, Groovy, Ruby, Haskell, WebIt! Etc. etc.
Sample Pipeline DB /CMS Character Set Mods Add Doctype + validate + strip doctype Lexical Re-arrange Elements Validation Lexical DOM Stats + FTP Schematron/ RelaxNG/ Rhino SQL Replace Jython XHTML Generate Java XSLT
Pipeline philosophy • Many XML transformations end up monolithic • Assertion : developers would use a more component based approach to XML processing if they did not have to write the plumbing (orchestration, exception handling) themselves • “Gee, this problem is complex. Maybe I’ll do it in multiple stages! Gee, now I have to orchestrate the stages somehow. Batch files/shell scripts/driver program – all ugly and error prone. Maybe I’ll just write a single program after all. Besides, it will run faster...”
Pipeline philosophy • “Professional developers spend 50 percent of their time writing plumbing” – Adam Bosworth • Pipelining promotes the creation of a reusable plumbing “layer” letting developers concentrate on the application in hand.
Philosophy Summary • Think flow - data processing == data transformation w.r.t. time – Michael Jackson • XML is the current runaway winner in the self-descriptive data stakes and a very good IDDL (Intermediate Data Description Language) for all types of data that are not natively XML based
Philosophy Summary • Inside every complex XML transformation is a sequence of simpler XML transformations trying to get out – a pipeline • Decomposed transformation: • new transformations + • already componentized transformations • -> Component Reuse Nirvana
Pipeline Philosophy Out Level 2 – Rudimentary orchestration In Out Level 1 - pipeline In Out Level 0 – transformation component In Out
Simple pipeline transformation component examples • Fundamental Operation – Rename Element • Rename • Input : <foo>baz</foo> • Output: <bar>baz</bar> foo bar baz baz
Simple pipeline transformation component examples • Fundamental Operation - Peel • Input : <foo><bar>baz</bar></foo> • Output: <foo>baz</foo> foo foo bar baz baz
Simple pipeline transformation component examples • Compound Operation - Matryoshka • Input: • <foo><bar>baz</bar></foo> • Output: • <foo></foo><bar></bar>baz foo bar foo bar baz baz
Simple pipeline transformation component examples • KlingonCloak • Input: • <foo><bar>baz</bar></foo> • Output: • <tag name=“foo”><tag name=“bar”>baz</tag></tag> foo tag type=“foo” bar tag type=“bar” baz baz
Simple pipeline transformation component examples • Reading a file is an XML to XML transformation • <file>lewisscarrol.xml</file> • <poem><line>Twas brillig, and the slithy tomes, did gyre and gimbal in the wave</line>…</poem>
Simple pipeline transformation component examples • Arithmetic is an XML to XML transformation • <expr>1 + 2</expr> • <res>3</res>
Simple pipeline transformation component examples • Unix pipe utilities e.g. tr • hello world • HELLO WORLD
A little orchestration in a transformation component • Conditionals are XML to XML transformation “tee junctions” triggered by XPaths if XPath TRUE branch In if XPath if XPath FALSE branch
Validation as a transformation component XML A XML A’ RelaxNG Schematron Jython/Java/JACL XComponent Input Output Validation Log Error
Sample Transformation Component Examples • Once you start thinking in terms of pipes – components appear everywhere: • Regular fragmentations • Doctype changer • Namespace normalizer • Character set transcoder • Hash generator • Architectural form processing • RelaxNG/Schematron etc
First objection • “It will be dog slow” or (stronger form): • “Re-usable tree transforming components won’t work in my shop – my XML files are too big to schlep around in strings, never mind DOMs!”
Document fulcra and the scatter/gather pattern • For any given transformation t to be performed on documents conforming to schema s, there is a fragment expression that can be used to chop each document into n pieces, on which t can be performed. • I call these points fulcra and are a function of (t,s)
Identifying Fulcra • For data-oriented XML, the fulcra often coincide with the “record” iteration in the XML schema and may be independent of t. • For document-oriented XML, the fulcra are much more dependent on t.
Document fulcra and scatter/gather pattern • Having identified the fulcra:- • Chop the input document into fragments – scatter phase • Perform t • Join all the processed fragments together to constitute the output document – gather phase • Three stage pipeline – scatter & gather either side of the core component
Input Doc Document Fulcra Scatter n fragments TIME Invoke t t t t t t n fragments Gather Output Doc
Document Fulcra • Note the data domain de-composition – SETI@Home meets XML markup. • Trivially parallelizable
Document Fulcra • A good fulcra based scatter/gather will make performance head north faster, cheaper and with a high upper limit than any amount of hand-crafted, genius level XML coding of your transformations in horrid SAX or lexical parse mode. • Massive Parallelism will kill all von Neumann throughput arguments • Documents per second, not seconds per document – throughput is the true measure of XML processing speed • Document fulcra – Locality of reference (Denning) applies to XML processing (more on this later)
More objections (with more answers) • It will be slow • No it won’t - Premature optimization is the root of all evil! • Speed is a three headed monster. I’m old enough to have left the X axis and currently heading for Y through Z The 3 Axes to Speed
Some objections (with some answers) • Component based software? Harumph! We have heard that one before… • Pipelines are data flow based not API based (COM, VBX, CORBA) • Two pin interfaces and minimal “verbs” • The XML “payload” is what is important – not the API - RESTian
Revisiting the XSLT/DOM -> SAX non-sequiter • XSLT and DOM are memory bound – trade off between ease of use and resource usage – ease of use favoured • SAX is not memory bound – trade off between ease of use and resource usage – low resource usage favoured • On xml-dev users often advised to rewrite their apps using SAX! Ugh!
XSLT/DOM -> pipeline • Pipelines and scatter/gather allow you to keep the ease of use of XSLT/DOM with the finite resource utilization of SAX • As long as you can identify a good fulcrum function • They exist more often than not • If they exist, they are very easily found and “drop out” of document analysis – eg: xpath expressions in XSLT stylesheet templates
Pipelining and Grids • Grid Technologies – computational power “on tap” (http://www.gridforum.org) • A match made in heaven (bandwidth permitting)
An XML Processing Grid – on demand Out In Out DMZ
Grids - caveats • For large data volumes it is simple not feasible to shunt the data over the wire – Jim Gray • Organizations are sensitive about their data going beyond firewalls • Pay-per-use “racks” in your back-office a better bet. – Rent a grid the way you would rent a chainsaw.
A Service Oriented Architecture “service” = XML transformation with side optional effects