460 likes | 561 Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2007. Today’s Topic: XML. XML = eXtensible Markup Language XML Namespaces Introduction to XML Processing. History Review (or: from thick …. SGML (Standard Generalized Markup Language)
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2007 COMS E6125
Today’s Topic: XML • XML = eXtensible Markup Language • XML Namespaces • Introduction to XML Processing COMS E6125
History Review(or: from thick … • SGML (Standard Generalized Markup Language) • ISO Standard 1986 for document storage & exchange • Metalanguage for defining markup languages (e.g., HTML) • Separation of content and display • Used by U.S. govt. & contractors, large manufacturing companies, CERN, … • SGML reference is 600 pageslong! COMS E6125
History Review… to thin … • XML (eXtensible Markup Language) • W3C (World Wide Web Consortium) -- http://www.w3.org/XML/ recommendation in February 1998 • Simple subset of SGML: “ASCII of the Web”, 80/20 rule • XML 1.0 specification is 36 pages for 1st edition, 50 pages for 4th edition (August/September 2006) • XML 1.1 about same length, requires namespaces and addresses unicode evolution (also Aug/Sep 2006) COMS E6125
Pareto's Principle: The 80-20 Rule • Vilfredo Pareto was an Italian economist who, in 1906, observed that 20% of the Italian people owned 80% of their country's accumulated wealth • Practical (non-economics) application: a small number of causes is responsible for a large percentage of the effect, in a ratio of about 20:80 COMS E6125
Example Applications 80% of benefit comes from the first 20% of effort (or last 20%) 80% of the decisions made in meetings come from 20% of the meeting time 80% of what we produce is generated during 20% of our working hours 80% of the outfits we wear come from 20% of the clothes in our closets and drawers 80% of our personal telephone calls are to 20% of the people in our address book 80% of an instructor's time is taken up by 20% of the students COMS E6125
History Review… and back!) • XML Namespaces 1.0/1.1 (January 1999/February 2004, revised August 2006) • XML Schema (May 2001, revised October 2004, 1.1 in progress) • Replaces DTDs (Document Type Definitions) • Not so simple: Part 0 (Primer), Part I (Structures), Part II (Data Types) • Xoo • Let’s peek atwww.w3.org COMS E6125
XML is… • The universal format for structured documents and data on the Web • For data exchange (messages) and persistent data (storage) • Syntax • Data Modeling • Data Processing COMS E6125
Why is XML Popular? • XML is a general data representation format • XML is human readable • XML is machine readable • XML is internationalized (UNICODE) • XML is platform independent • XML is vendor independent • XML is endorsed by the W3C • XML is not a new technology • XML is not only a data representation format COMS E6125
Back to the Future • Someone in the far future sends a message in a virtual bottle, containing parts of the universal library of human and post-human literature, back into the 1970s when ... • … the Web, XML, P2P, Java were unheard of • ... computer manufacturers talked about mips and kilobytes • … music was played by rotating vinyl discs under a diamond-tip stylus or on cassette tapes COMS E6125
… and Microsoft looked like COMS E6125
The Message in the Bottle, 1st try ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@^@ ^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ COMS E6125
The Message in the Bottle, 2nd try \documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ... \section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ … \end{verse} \bibliographystyle{abbrv} \bibliography{msg} \end{document} COMS E6125
The Message in the Bottle, finally <?xml version=“1234.56"?> <universal_library> <books> <book> <title>Some Quotations from the Universal Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate. </line> <line>Rough winds do shake the darling buds of May, </line> … </verse> </section> </book> … </books> </universal_library> COMS E6125
XML as a Self-DescribingData Exchange Format • Someone from the 1970s receives the message in the virtual bottle, and it … • … can be easily “understood” (even using CP/M & edlin) • … can be parsed easily • … allows the application programmer to rediscover schema and semantics(sort of…) • … may include an explicit schema description • … allows separation of marked-up content from presentation COMS E6125
Root tag start-tags <tag> have matching end-tags </tag> properly nested can be abbreviated to <tag/> for empty elements (<tag></tag>) tags are case sensitive <bibliography> <paper ID=“goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> XML is Based on Markup COMS E6125
XML Anatomy element name element attribute name element content <bibliography> <paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> attribute value (attributes cannot contain elements) number content empty element character content COMS E6125
Perspectives on XML • Document (SGML) Community • data = linear text documents • mark up (annotate) text to describe context, structure, semantics • Database Community • XML as a prominent example of the semi-structured data model • captures the whole spectrum from highly structured, regular data to unstructured data COMS E6125
More Perspectives on XML • "XML is the cure for your data exchange, information integration, e-commerce, … problems”(also cures baldness, lose 28 pounds in 14 days, get rich quick, …) • "XML is just another syntax (for Lisp, trees, …)” (books (book (author“Shakespeare” ) (title“Sonnets”) (verse (line“Shall I compare thee…” ) (line …) …))) COMS E6125
A <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> B C C A: B: "foo" "foo" "bar" "psl" children are ordered C: "bar" C: "psl" Pure XML - Instance Model • XML 1.0 implicit data model: • nested containers ("boxes within boxes") • labeled ordered trees (= semistructured data model) • relational, object-oriented easy to encode COMS E6125
Identifying Vocabularies • My element may not be your element: • geometry context: <element>line</element> • chemistrycontext:<element>oxygen</element> • An XML Schema (with XML 1.1) defines a vocabulary of names of type definitions, element and attribute declarations • Use XML Namespaces(with XML 1.1) to identify which vocabulary COMS E6125
XML Namespaces • Simple method for qualifying element and attribute names used in XML documents by associating them with namespaces identified by URI references • Useful when a single XML document contains elements and attributes that are defined for and used by multiple software modules COMS E6125
XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace. The declaration is in scope for the element containing the attribute and all its descendants. <html:html xmlns:html='http://www.w3.org/1999/xhtml'> <html:head> <html:title>Frobnostication </html:title> </html:head> <html:body> <html:p>Moved to <html:a href='http://frob.example.com'>here.</html:a> </html:p> </html:body> </html:html> Namespace Scoping COMS E6125
Namespace Defaulting <?xml version="1.1"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body> </html> COMS E6125
Multiple Namespaces <bk:book xmlns:bk='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6' xmlns:money='urn:Finance:AllAboutMoney'> <bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number> <bk:price money:currencySymbol="$">99.99</bk:price> </bk:book> COMS E6125
Namespace Defaulting with Multiple Namespaces <book xmlns='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> </book> Unprefixed element types are from books COMS E6125
Nested Scoping <?xml version="1.1"?> <!-- initially, the default namespace is "books" --> <book xmlns='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title><isbn:number>1568491379</isbn:number><notes> <!-- make HTML the default namespace for some commentary --> <p xmlns='urn:w3-org-ns:HTML'> This is a <i>funny</i> book! </p></notes> </book> COMS E6125
How to Define the Actual Namespace • The W3C namespace specification doesn’t say • A namespace doesn’t actually have to exist as a physical or conceptual entity • All that is needed is a qualifier—the XML namespace URI (or IRI)—that, in combination with an element type or attribute name, creates a universal (and universally unique) name COMS E6125
XML Namespaces • Allows mixing of different tag vocabularies • Only identifies the vocabulary • Additional mechanisms required for structure and meaning of tags COMS E6125
Processing XML • Non-validating parser: • checks that XML doc is syntactically well-formed • Validating parser: • checks that XML doc is also valid wrt a given XML Schema COMS E6125
Processing XML • Tree representation: • Document Object Model (DOM) API • Cursor APIs, e.g., .NET’s XPathNavigator • Stream of events representation: • Push Model, e.g., Simple API for XML(SAX) • Pull Model, e.g., Common API for XML Pull Parsing (XmlPull) • Others COMS E6125
Document Object Model • Object-oriented approach to traversing the XML document • Hierarchy of Node objects mapping to XML concepts: document, element, attribute, processing instruction, comment, … • Typically loads the entire XML document into memory (random access) • Provides mechanisms for loading, saving, accessing, querying, modifying, and deleting nodes from an XML document COMS E6125
Document Object Model • W3C DOM offers fairly limited functionality, primarily because it was designed to be a generic API that could be implemented in a variety of programming languages • So people who utilize the DOM API often use helper methods - extensions specific to particular implementations COMS E6125
Push Model • XML producer (typically an XML parser) controls the pace of the application and informs the XML consumer when certain events occur (e.g., reports events when encountering begin/end tags) • XML consumer registers callbacks with the producer, which invokes the callbacks as various parts of the XML document are seen (as events are reported) • Does not build a parse tree COMS E6125
Push Model Pros • The entire XML document does not need to be stored in memory, only the information about the node currently being processed is needed • This makes it possible to process large XML documents without incurring massive memory costs • Can also process XML streams whose contents arrive over time • Allows consumer to ignore less interesting data COMS E6125
Push Model Cons • Certain context and state information such as the parents of the current node or its depth in the XML tree must be tracked by the programmer • Limited expressive power (query/update) when working on streams • To register callbacks one needs to create a class devoted to handling events from the producer • Many developers find callbacks to be an unintuitive way to control program flow COMS E6125
Pull Model • XML Consumer controls the program flow by requesting events from the XML producer as needed • Operates in a forward-only, streaming fashion while only showing information about a single node at any given time • Programmer creates a loop that continually reads from the XML document until the end of the document is reached, but acts solely open items of interest as they are seen COMS E6125
Pull Model Comparison • As memory efficient as push model processing but with a more familiar programming model • Does not require a specialized class for handling XML processing to implement specific interfaces or subclass certain classes to register callbacks • The need to explicitly track application states using boolean flags and similar variables is significantly reduced COMS E6125
XML Cursors • Cursor acts like a lens that focuses on one XML node at a time, but, unlike pull-based or push-based APIs, the cursor can be positioned anywhere along the XML document at any given time • Allows one to navigate, query, and manipulate an XML document loaded in memory • Does not require the heavyweight interface of a traditional tree model API, where every significant token in the underlying XML must map to an object • Can create XML views of non-XML data COMS E6125
Other Alternatives • Object to XML Mapping APIs • Represent nodes and text as classes and programming language primitives • Cannot represent all XML information with full fidelity, e.g., lose processing instructions and comments, element ordering • Impedance mismatches between XML Schema and object-oriented concepts • XML-specific languages – XPath, XQuery, XSLT, … COMS E6125
Reminders • Detailed paper proposal due February 12th • Full paper due February 26th • Paper must be individual - ok to discuss topic with others but not ok to get help with writing the paper from others • Preliminary project proposal due March 5th • Projects may be done by 1-5 students in the class (no one outside the class!) COMS E6125
Next Assignment: Detailed Paper Proposal • Due Monday February 12th at 5pm • Post on CourseWorks • Full paper due February 26th • Paper counts 45% of final course grade • Everyone should have already received feedback on their preliminary proposals – if not, contact instructor immediately at kaiser+4156@cs.columbia.edu COMS E6125
Detailed Paper Proposal • Assume a reader who is taking the class but may not know anything at all about your specific topic • You must relate your topic in some way to the World Wide Web (Internet != WWW, networking != WWW, mobile phones != WWW, computer use != WWW, computerized record keeping != WWW, Java != WWW, email != WWW, instant messaging != WWW, chat != WWW, multimedia != WWW, …) COMS E6125
Detailed Paper Proposal • Plan and outline your paper (which should be ~15 pages) • Each full paper should have title, author, abstract (~200 words), introduction, body sections, conclusions, bibliography • The point of this assignment is to determine what will be in those sections (and, optionally, subsections) and motivate your further reading COMS E6125
A Note about Bibliographic References • References should be cited as warranted in the text like this [1] or this [Kai07] • Bibliography entry should appear something like this[Kai07] Gail Kaiser, COMS E6125 Web-enHanced Information Management, Columbia University Department of Computer Science, 2007, http://york.cs.columbia.edu/classes/cs6125/. COMS E6125
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2007 COMS E6125