200 likes | 354 Views
Structured -Document Processing Languages. (5 cp), Spring 2011 Pekka Kilpeläinen University of Eastern Finland School of Computing Pekka.T.Kilpelainen@uef.fi. Introduction. First: Overview and Arrangements
E N D
Structured-Document Processing Languages (5 cp), Spring 2011 PekkaKilpeläinen University of Eastern Finland School of Computing Pekka.T.Kilpelainen@uef.fi Notes 1: Introduction
Introduction First: Overview and Arrangements What is this course about?1.1 Review of Structured-Documents 2.1 XML & XML Documents Notes 1: Introduction
Goals of the Course • To get familiar with central models and languages for • Manipulating • transforming and • querying structured documents (or XML) • “Generic XML processing technology” • very little about specific XML applications, or commercial systems Notes 1: Introduction
NOT an Exhaustive Survey • Bias in selecting course topics: • estimated usefulness/value • centrality (→ longer-lasting value) • maturity: Standards, or stable specifications? Robust implementations? • Lecturer up-to-date? • Emphasis on programmatic access and manipulation of data in the form of documents, rather than describing/modelling it Notes 1: Introduction
“Programming for XML is very different from programming for other data models” - D. Florescu & D. Kossmann at SIGMOD’06 Motivation? • Interest in models of information processing • Practical relevance of documents and XML e.g. order XML e.g. invoice Internet / intranet Notes 1: Introduction
Course Outline 1 IntroductionOverview and Arrangements1.1 Structured Documents 2 XML Basics - XML documents, DTDs, Namespaces 3 Programmatic Manipulation of Structured Documents (XML APIs)- SAX, DOM, StAX, JAXP Notes 1: Introduction
Course Outline (2) 4 Transforming Structured Documents4.1 Addressing: XPath 1.04.2 XSLT 1.0 5 Querying Structured Documents- W3C XML Query Language XQuery 6 Review of the Course Notes 1: Introduction
Methodological Goals • Central professional skills • consulting technical specifications • experimenting with SW implementations • Ability to think…? • to find out relationships, reason, explain, ... • to apply knowledge in new situations • ("Pidgin English" for scientific communication) Notes 1: Introduction
Administration • An elective Master-level special course • suitable for all specialisation lines (esp. media technology and SWE) • 5 cp ( 133 h 20 min of work!) • Lectures March 9 – May 5, class MT3 • Video broadcast to TB180 in Joensuu • Lecturer: Pekka.T.Kilpelainen@uef.fi NB! Notes 1: Introduction
Administration: Exercises • Exercises, March 14 – May 9 • essential for learning the technology • normal homework assignments, hands-on practice; Solutions discussed in class • Grading: one prepared solution -> one activity point • Plan: • some solutions shall be returned to Moodle, for potential feedback and scaling of activity points by solution quality • These will be announced on a case-by-case basis Notes 1: Introduction
Administration: Grading • Final exam on Friday, May 13 • ≥ 50% of exam points required to pass • Grade = round(6*Exam/MaxExam + 2*HomeWork/MaxHomeWork - 2.5) • Retake exam on Friday June 10 • again 50% to pass • better of grades with/without homework credits Notes 1: Introduction
Material • No single textbook • Reports, specifications, articles • Course home page • http://www.cs.uku.fi/~kilpelai/RDK11/ • slides, exercises, references, announcements • Possible background texts (See home page): • Deitel et al; Møller & Schwartzbach; Key; Walmsley Notes 1: Introduction
Background Check • Basic knowledge of structured documents and document standards • Course ”Introduction to Document standards"? • Programming languages and concepts • Java? OO programming? • Functional programming? SQL? • Unix/Linux vs. Windows? • Formal language theory • Theory of Computation • regular expressions, context-free grammars, parse trees? Notes 1: Introduction
Students' Expectations? Notes 1: Introduction
1.1. Structured Documents • Document: • a structured representation of information on some medium ( message) • normally for a human reader • memos, manuals, articles, books, … • also application-to-application messages • e.g., btw client and server in Web Services • "prose-oriented XML" vs "data-oriented XML" • can be treated as a unit • e.g., a web page vs a web site? Notes 1: Introduction
Text-Based Documents • We concentrate on textual or text-based documents • character data major constituent of information content • as opposed to, say multimedia documents • Next: Presentation vs Structure Notes 1: Introduction
Presentation vs Structure • Presentation informs the human readerabout the meaning of text and the role of its parts • Markup (merkkaus)indicates the presentation or the meaning of different parts of text • originally hand-written annotations for the typesetter • nowadays primarily codes embedded in digital documents; <Tags> Notes 1: Introduction
Markup • Procedural markup • formatting commands (start boldface, produce an empty line, indent 5 mm, ...) • proprietary word processor formats, nroff, TeX, ... • Descriptive or generic markup • indicating the logical structure of text using chosen names • LaTeX: \begin{abstract} ... \end{abstract} • HTML: <TITLE> .... </TITLE> • Markup language (merkkauskieli) • a fixed set of markup notations (e.g. nroff, TeX, HTML, SVG, …) Notes 1: Introduction
Structured Documents? Most liberally, any document is structured (punctuation, words, sentences, fields, …) but especially descriptively marked-up documents ... (e.g. well-formed XML) especially if they adhere to a rigorous specification of structure (e.g. XML+DTD) Notes 1: Introduction
Structure in Documents • Hierarchy or nesting is ubiquitous • chapters of books, warnings in maintenance manuals, ... • Linear order essential in prose documents • less important in representations of data objects • Hypertext and cross-references • We'll be mainly dealing with manipulation of hierarchical, or tree-like document structures Next: How are these modelled? Notes 1: Introduction