200 likes | 327 Views
High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers. Wei Zhang & Robert van Engelen Department of Computer Science Florida State University. Presentation Overview. Schema-specific Parsers Related Work
E N D
High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers Wei Zhang & Robert van Engelen Department of Computer Science Florida State University IEEE ICWS 2008
Presentation Overview • Schema-specific Parsers • Related Work • PTDX: Table-Driven XML Parser with Permutation Phrase Grammar • Performance • Conclusion IEEE ICWS 2008
Schema-specific parsers • Compile-time vs. Run-time Parsers • Compile-time parsing and validation approaches use specialized compilation techniques to generate customized parsers from schemas • Run-time approaches use generic drivers( or engines) and grammar-like representation of schemas • Blocking vs. non-blocking Parsers • Blocking parsers may suspend the entire program for sufficient XML content received. E.g. recursive based parsers • Non-blocking parsers always control the program and buffered data can be incrementally supplied • Time-efficient vs. Space-efficient Parsers • Time efficient but encoding many states • Space efficient but with backtracking IEEE ICWS 2008
Related Work • [Van Engelen, 2001] • The earliest work on schema-specific LL(1) recursive descent parser w/ namespace support and validation • [Van Engelen, 2004] • Two-level DFA integrating parsing and validation • [Chiu et al., 2004] • Using nondeterministic generalized automata to merge all aspects of low-level parsing and validation • [Reuter, 2003] • Using Cardinality-Constraint Automaton (CCA) to perform schema-aware validation IEEE ICWS 2008
Related Work (Cont’d) • [Kostoulas et al., 2006] • An efficient parser generator that translates XML schema into a parser either in C or Java • [Matsa, 2007] • Schema-directed interpretive XML parser using special purpose byte-codes. • [Zhang et al., 2006] • A table-driven approach parsing and validating in a single pass • Generator that translates schema in C IEEE ICWS 2008
PTDX: Table-Driven XML Parser with Permutation Phrase • Table-driven grammar-based parser • Extended LL(1) grammar with permutation phrase support • Parsing table is constructed from extended LL(1) permutation grammar • Run-time parser • Generic parsing engine (2-stack PDA) • Both time and space efficient • Predictive parsing • Integrating parsing and validation into a single pass • No buffering • Operating on tokens • Main stack size growing in depth of XMLdata • Auxiliary stack size growing in number of elements of <xs:all>, <xs:attribute> • Non-blocking parser IEEE ICWS 2008
Extended LL(1) Permutation Phrase Grammar LL(1) Parsing Table Mapping Rules Token Table Action Table Constructing PTDX Tables XML Schemas Note: actions are generated from schemas to perform type-checking verification although some validation constraints are incorporated in grammar productions. IEEE ICWS 2008
Mapping Rules • Define translation from schema components to LL(1) grammar productions • Preserve structural constraints • Map Free-ordered schema components (<xs:all>, <xs:attribute>) to permutation grammar IEEE ICWS 2008
<complexType name=“T”> <all> <element name=“a” type=“string” minOccurs=“0”/> <element name=“b” type=“string”/> <element name=“c” type=“string”> </all> </complexType> T → << A || B || C >> A → bA CD eA A → ε B → bB CD eB C → bC CD eC Note: bA and eA representing tokens of starting and closing element “a” Respectively; CD representing token of CDATA Mapping Example IEEE ICWS 2008
Permutation Phrase A permutation phrase is a grammatical phrase that specifies a syntactic construct as any permutation of a set of constituent elements. E.g., the permutation phrase << a || b || c >> recognizes language {abc, acb, bac, bca, cab, cba} IEEE ICWS 2008
top top Two-stack PDA for Parsing Permutation Phrase << a || b || c>> Input: b c a Input: b c a Input: bc a top abc top bc ac a Main stack Aux stack Main stack Aux stack Main stack Aux stack 2 3 1 IEEE ICWS 2008
Input: bca Input: Input: bc a bca top top top a a c Main stack Aux stack Main stack Aux stack Main stack Aux stack 4 5 6 Two-stack PDA for Parsing Permutation Phrase (Cont’d) << a || b || c>> Note: All optional constituent elements are left on auxiliary stack once all non-empty elements have been parsed. IEEE ICWS 2008
PTDX Architecture Hot-swappable IEEE ICWS 2008
Schema-directed Scanner • Optimized by schema • E.g., scanning a specific tag name is more efficient than scanning the generic string then doing comparison • Tokenizer • Breakes XML message into token stream • Token • Defined by element names, attribute names, enumeration values • Classified as starting tags and closing tags • Normalized namespace binding • <namespace, tag_name> IEEE ICWS 2008
Experiment Settings • Test environment • 3.0 GHz, 2GB RAM, Linux 2.6.20-1.2320, GCC 4.1.1 with option -02 • Memory-resident message • Randomly arranged free ordered elements • Compared with • Validation parsers • gSOAP 2.7 • Xerces 2.7.0 • pTDX flex based parser • Non-validation parsers • Expat 2.0.1 • DFA-based parser IEEE ICWS 2008
Test Cases IEEE ICWS 2008
Better performance Performance: comparison of validating and non-validating parsers IEEE ICWS 2008
Better performance Performance: effect of number of elements in <xs:all> of PTDX parser IEEE ICWS 2008
Performance: runtime and compile time memory usage comparison(32 <xs:all> elements) IEEE ICWS 2008
Conclusion • Free ordered constraints can be parsed and validated efficiently using a 2-stack PDA • Table-driven permutation phrase grammar parsing technique is time and space optimal • Table-driven approach offers flexible framework for dealing with schema evolvement IEEE ICWS 2008