310 likes | 420 Views
Processing of structured documents. Spring 2002, Part 1 Helena Ahonen-Myka. Course organization. 581290-5 laudatur course, 3 cu lectures (in Finnish) 22.1.-21.2. Tue 12-14, Thu 10-12 not obligatory exercise sessions 29.1.-27.2.
E N D
Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka
Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 22.1.-21.2. Tue 12-14, Thu 10-12 • not obligatory • exercise sessions • 29.1.-27.2. • course assistants: Olli Lahti and Miro Lehtonen (new group Wed 12-14 A318) • not obligatory
Requirements • Exam (Wed 6.3. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60
Outline (preliminary) • 1. Descriptions of structure • context-free grammars • namespaces, information sets • (XML DTD,) XML Schema • 2. Programming interfaces • SAX, DOM • SOAP • 3. Traversing documents • XPath
Outline... • 4. Querying structured documents • XML Query • 5. XML Linking • 6. XML databases • 7. Metadata: RDF • 8. Compressing XML data • 9. ...
Prerequisites • You should know the basics of XML • DTD, elements, attributes, syntax • XSLT (basics), formatting • some programming experience is needed
Group project • Group of 4-5 students • groups are formed in the exercise sessions in the 2nd week • Task: construct a toy B2B e-commerce application • a travel agency which sells packages containing hotel nights and concerts • a hotel (or several) • a concert ticket office
Group project • Task continues • a customer can reserve packages using a web page • a reservation causes a query to the hotels and the ticket offices for the availability of rooms and tickets • for all the communication and for the storage of all the documents you should use XML
Group project • Try to get some simple implementation work • may depend on the support we can offer • you don´t have to consider all the real life problems, like consistency of reservations • concentrate on playing with XML • state of the work is presented in the last exercise sessions (also students who don’t normally attend exercises)
Requirements for project • More instructions follow later... • return a report by 22.3. (as an URL) • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system
1. Structure descriptions • Regular expressions, context-free grammars -> What is XML? • (XML Document type definitions) • namespaces, information sets • XML Schema
Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components
Regular expressions • A regular expression over alphabet is either • (an empty set) • (epsilon; sometimes lambda ) • a, where a • R | S (choice; sometimes R S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions
Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() = (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a ) • L(R|S) = L(R) L(S) = {w | w L(R) or w L(S)} • L(RS) = L(R)L(S) = {xy | x L(R) and y L(S)} • L(R*) = L(R)* = {x1…xn| xk L(R), k=1,…,n; n 0}
Example • top-level structure of a document: • = {title, author, date, sect} • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+
Context-free grammars • Used widely for syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V = N • : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S N: the start symbol
Productions and derivations • Productions: A -> , where A N, V* • e.g. A -> aBa (1) • Let , V*. String derives directly, => , if • = A, = for some , V*, and A -> is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)
Language generated by a context-free grammar • derives , =>* , if there is a sequence of 0 or more direct derivations that transforms to • The language generated by a CFG G: • L(G) = {w * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees
Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production
CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList -> • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs
Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let , V*. String derives directly, => , if • = A, = for some , V*, and A -> E is a production such that L(E) • e.g. Ref => Author Author Author Title PublData
Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same
Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)
What is XML? • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to an ECFG are strings in that language • document types (grammars)- document instances (strings in the language)
XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)
Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in XML schema formalisms
An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>
And a document: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>
XML processing model • A processor (parser) • reads XML documents • passes data to an application • XML Specification tells how to read, what to pass
Well-formed XML documents • documents that adhere to the formal requirements (syntax) of the XML specification • if a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)
Valid documents • a document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given • XML-processor can be validating or non-validating • sometimes validity is important, sometimes not