1 / 31

LT PyXML: A fast validating XML parser embedded in Python

LT PyXML: A fast validating XML parser embedded in Python. Henry S. Thompson HCRC Language Technology Group University of Edinburgh. Acknowledgements.

vevay
Download Presentation

LT PyXML: A fast validating XML parser embedded in Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LT PyXML: A fast validating XML parser embedded in Python Henry S. Thompson HCRC Language Technology Group University of Edinburgh

  2. Acknowledgements • This work was carried out in the Language Technology Group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council • The UK Engineering and Physical Sciences Research Council funded project NSCOPE, which stimulated some of the work discussed here today • This work was also helped by grants to our group from Sun Microsystems and Microsoft

  3. How we use SGML/XML • We use SGML and XML in the context of collecting, standardising, distributing, annotating and using large text collections (corpora) for computational linguistics research and development • These corpora are: • Large: 10-100 million words • Densely annotated: often every word has associated markup • DTDs and validation are very important to us

  4. An aside about validation • A DTD or schema is a contract between producers and consumers • It provides a guaranteed interface • Producers validate to ensure they are providing what they promised • Consumers validate to check up on producers • and to protect their applications • Application authors validate to simplify their task • Leave error detection and analysis to the validating parser

  5. How we use XML (2) • Like any other SME, we produce documents • Being a university-embedded SME, we produce lots of documents • Lots of those documents are trivial variations on one-another, based on target medium and/or audience • Overhead slides for teaching • Web pages for publicity/teaching backup • Presentation slides for conferences • Research papers for monographs and journals

  6. Our application needs • Batch applications to automatically add linguistic annotation • Modular, pipelined programs supporting data parallelism • Specialised interactive editors to hand-correct markup • Authoring tools and publication tools which make content-sharing easy

  7. We built software: RXP & LT XML • because of the following issues: • Price • Efficiency • C-language interface • Documentation • Contrast with EXPAT • 50 to 100% slower • but still 90% faster than Java implementations • Thoroughly documented • Validates • Coverage nine nines identical

  8. LT XML: Basic Architecture • Pipelines of ‘fat’ streams • c.f. Unix ‘thin’ streams • API provides primitives for XML-appropriate input and output • Two alternative views: • micro-sequence: start-tag, comment, char-data, end-tag, proc. inst • tree-structure: sequence of sub-trees, level ad lib.

  9. Flat view • provides GetNextBit which reads the next bit of XML: • Start/empty tags (including attributes and all values) • Text==PCDATA • End tags • Processing instructions • PrintBit will write one of these to an output stream

  10. Tree-structured view • Items are subtrees of the SGML structure • Reading • GetNextItem • GetNextQueryItem • Writing • PrintItem • The two views (flat or tree-structured) can be mixed to suit the needs of the application

  11. Query language • LT XML defines a query language which allows the specification of elements from an XML document • Queries are tree based, using element names, attribute values and textual data • Similar path-style syntax to XPath • Regular expressions are allowed for attribute values.

  12. Query language, continued • The LT XML query language is not a complete relational query language, although that can be built on top • For efficiency reasons, LT XML doesn't allow queries which require back-tracking or an unbounded amount of left context • The query language allows programmers to quickly find the sub-structure they are interested in, while ignoring the rest

  13. Query example .*/TEXT/./P[TYPE=STD]/S[1]

  14. Simple Tools are Simple to Build • Less than one page of C code to produce simple application • Pipelines mean you can compose simple tools for complex applications

  15. Pre-constructed Tools • Extract text content: textonly • Select fragments based on tags, attributes and text content: sggrep • Count tags: sgcount • Production-system style transformation: sgmltrans • Simple pattern-based information extraction: sgrpg • Indexing for fast access: mkindex

  16. Availability • Free to all for research use • Executables and libraries for Unix (Solaris, SunOs, Linux, FreeBSD) and Win32 • Sources for Unix • Packaged executable for Mac • http://www.ltg.ed.ac.uk/software/xml/

  17. What about user interaction? • C is not the world's easiest or most portable GUI-building environment • We have inhouse clients who are happy with scripting languages • So we've embedded LT XML inside a number of other contexts • Common Lisp • Perl • Python • It's the Python embedding that's the main topic for today

  18. LT PyXML Basics • A C-implemented Python module • Integrates the LT XML API into Python • Architecture • Both views (bits and tree fragments) • Objects • including garbage collection • Functions • A modest subset • We've used the Tkinter module for all our GUI work, put Python has other GUI options

  19. LT PyXML functions • Files • Open, OpenString, Fopen, Close • Bits • GetNextBit, ItemParse • Attributes • GetAttrVal, ItemActualAttributes, PutAttrVal • Queries • ParseQuery, GetNextQueryItem • Printing • Print, PrintEndTag, PrintStartTag, PrintTextLiteral

  20. LT PyXML Objects • Use native Python lists and dictionaries where we can • New primitive Objects, often lazy wrt pullthrough • Files • NSL_File • Doctypes • NSL_Doctype, NSL_ElementType, NSL_AttrDefn, NSL_ContentParticle • Instances • NSL_Bit, NSL_Item, NSL_ERef , NSL_OOB • Queries • NSL_Query

  21. LT PyXML limitations • 8-bit character inventory (Python/Tk limitation) • I haven't delivered on the promise in the abstract, but • The binary is in the XED distributions • A proper release will appear shortly

  22. Three applications • XED • instance access minimal • doctype access minimal • Schema workbench • instance access paradigmatic • depends heavily on validation • XML DTD Normaliser • instance access non-existent • doctype access paradigmatic

  23. XED • A text editor for XML document instances • Implemented in Python using LT PyXML and Tkinter • Optimised for hand-authoring small- to medium-sized documents • Cross-platform • Free of charge • Sources not yet available

  24. XED features • Single-window WYSIWYG presentation • Add, remove and rename balanced start/end tag pairs and empty elements • Add, remove and rename attribute name/value pairs • Add or remove comments, CDATA sections and processing instructions • Context-sensitive tag and attribute menus

  25. XED features, cont'd • Filling of text content, indenting of element-only content • Structure-sensitive point-and-sweep selection paradigm • Structure-preserving cut and paste • Multiple undo • Key bindings based on xxxPad under WIN32; based on Emacs under Unix

  26. XED demo • See http://www.ltg.ed.ac.uk/ht/xed.html • The vast bulk of XED is Python/Tk, but it's made possible by LT PyXML • Control of text segments • Control of OOB processing • Context-sensitive menus are initialised from the DTD • Really helps newcomers to XML get started • Cannot produce ill-formed XML

  27. Schema Workbench demo • Not publically available yet • Built to facilitate development of the XML Schema spec • When I started writing large schemata which exploited the refinement aspects of the public WD • I needed to see the type hierarchy • I needed to produce a normalised DTD to compare with the originals

  28. Schema Workbench features • The schema document to schema structures part of this took less than a day to write • Two main reasons • Validation on the way in meant • I could depend on the presence of required components • I didn't need to check for misplaced bits • Python's object-creation and evaluation facilities • Turned most NSL_Items directly into Python objects with object type == GI • Once I had the structures, implementing refinement was easy

  29. DTD normaliser • This was a two hour, 1.5 page job: • Find the DTD • Construct a string file which uses it • Open that string • Sort the doctype • Print the declarations, sorting disjunctions

  30. I can't resist :-) • Once I got the tools built, I could diff the normalised XHTML draft DTD and the DTD produced from my XHTML schema • I found one error • in the DTD!

  31. When it's time to railroad,everybody railroads • The next big challenge for XML, Schemas particularly is • Managing the mapping between document infoset and application infoset • LT PyXML has proved to be a useful laboratory for exploring this issue

More Related