Processing of structured documents

Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka

Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 22.1.-21.2. Tue 12-14, Thu 10-12 • not obligatory • exercise sessions • 29.1.-27.2. • course assistants: Olli Lahti and Miro Lehtonen (new group Wed 12-14 A318) • not obligatory

Requirements • Exam (Wed 6.3. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60

Outline (preliminary) • 1. Descriptions of structure • context-free grammars • namespaces, information sets • (XML DTD,) XML Schema • 2. Programming interfaces • SAX, DOM • SOAP • 3. Traversing documents • XPath

Outline... • 4. Querying structured documents • XML Query • 5. XML Linking • 6. XML databases • 7. Metadata: RDF • 8. Compressing XML data • 9. ...

Prerequisites • You should know the basics of XML • DTD, elements, attributes, syntax • XSLT (basics), formatting • some programming experience is needed

Group project • Group of 4-5 students • groups are formed in the exercise sessions in the 2nd week • Task: construct a toy B2B e-commerce application • a travel agency which sells packages containing hotel nights and concerts • a hotel (or several) • a concert ticket office

Group project • Task continues • a customer can reserve packages using a web page • a reservation causes a query to the hotels and the ticket offices for the availability of rooms and tickets • for all the communication and for the storage of all the documents you should use XML

Group project • Try to get some simple implementation work • may depend on the support we can offer • you don´t have to consider all the real life problems, like consistency of reservations • concentrate on playing with XML • state of the work is presented in the last exercise sessions (also students who don’t normally attend exercises)

Requirements for project • More instructions follow later... • return a report by 22.3. (as an URL) • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system

1. Structure descriptions • Regular expressions, context-free grammars -> What is XML? • (XML Document type definitions) • namespaces, information sets • XML Schema

Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components

Regular expressions • A regular expression over alphabet  is either •  (an empty set) •  (epsilon; sometimes lambda ) • a, where a   • R | S (choice; sometimes R  S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions

Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() =  (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a  ) • L(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} • L(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} • L(R*) = L(R)* = {x1…xn| xk  L(R), k=1,…,n; n  0}

Example • top-level structure of a document: •  = {title, author, date, sect} • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+

Context-free grammars • Used widely for syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V =   N •  : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S  N: the start symbol

Productions and derivations • Productions: A -> , where A  N,   V* • e.g. A -> aBa (1) • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A ->  is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)

Language generated by a context-free grammar •  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  • The language generated by a CFG G: • L(G) = {w  * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees

Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production

CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A -> E is a production such that   L(E) • e.g. Ref => Author Author Author Title PublData

Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same

Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk  L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

What is XML? • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to an ECFG are strings in that language • document types (grammars)- document instances (strings in the language)

XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)

Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in XML schema formalisms

An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>

And a document: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>

XML processing model • A processor (parser) • reads XML documents • passes data to an application • XML Specification tells how to read, what to pass

Well-formed XML documents • documents that adhere to the formal requirements (syntax) of the XML specification • if a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)

Valid documents • a document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given • XML-processor can be validating or non-validating • sometimes validity is important, sometimes not

Processing of structured documents