880 likes | 995 Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2012. Today’s Topics. History of markup languages HTML = HyperText Markup Language XML = eXtensible Markup Language Document Structure Definition XML Schema (XSD) Processing XML. What is Markup?.
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2012 Kaiser: COMS E6125
Today’s Topics • History of markup languages • HTML = HyperText Markup Language • XML = eXtensible Markup Language • Document Structure Definition • XML Schema (XSD) • Processing XML Kaiser: COMS E6125
What is Markup? • Special text (“mark”) that is added to the regular content text of a document in order to convey some information about it • A markup language is a formalized way of providing markup, and specifies: • what markup is allowed (the lexicon) • what markup is required vs. optional • how markup is distinguished from content text • what the markup “means” Kaiser: COMS E6125
Specific Coding • Historically, electronic manuscripts contained procedural control codes (markup) that caused the text to be formatted in a particular way • tj6 • troff • TeX Kaiser: COMS E6125
Procedural Markup • Advantages: • Instructs agent how to process text • Generally concerned with formatting and presentation • Is “efficient” because requires little further interpretation • Disadvantages • Often specific to one proprietary processing system • Usually ties a document to a single purpose • printing on a paper • viewing on a screen • provides no information on “meaning” Kaiser: COMS E6125
Example Specific Coding .SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes: .TB 4 TaB stop .OF 4 OFfset .SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed on those elements. .OF 0 .SK 1 SKipping vertical space Kaiser: COMS E6125
Generic Coding • In contrast, generic (or generalized, or descriptive) coding uses descriptive tags (e.g., “heading”) • Scribe • LaTeX • HTML Kaiser: COMS E6125
Descriptive Markup • Advantages • Is (usually) human and machine readable • Identifies information content, the logical components of a document • Is not directed towards a particular purpose or rendition of the document, therefore can be non-proprietary • Disadvantages: • Generally concerned with what text is • Does not specify what procedures are to be applied to text • Therefore requires that other process(es) supply formatting and presentation Kaiser: COMS E6125
Example Generic Coding <p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes: <ol> <li>Separating the logical elements of the document; and <li>Specifying the processing functions to be performed on those elements. </ol> Kaiser: COMS E6125
Who Invented Markup? • Specialized markup: ??? • Generalized markup: • Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967 • Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970 Kaiser: COMS E6125
An Early Implementation • At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of an office automation project integrating text editing with information retrieval and page composition • Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure • By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given header-level to be automatically formatted identically • Productized in 1973 in IBM’s Document Composition Facility (DCF) Kaiser: COMS E6125
Example GML :h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one), :li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized and formalized in SGML), allowed the end-tags to be omitted for the "h1" and "p" elements. Kaiser: COMS E6125
SGML = Standard GML • Standardization effort started in 1978, when ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee • Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD) • ISO (International Organization for Standardization) joins ANSI effort in 1984 • International SGML standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN) • Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989? Kaiser: COMS E6125
SGML • A metalanguage (grammar) • How to write tags, how to define the document structure • Structural paradigm is that of • an inverted tree structure, a root component branching out into leaves • or a series of nested containers • Defines three kinds of objects • Elements are the basic structural components • Attributes are qualities of elements • Entities are a short representation of special characters Kaiser: COMS E6125
SGML Pro and Con • Advantages: • Documents held in a standards-based, non-proprietary, platform-independent storage format • Scope for document re-use and re-presentation, enhancement of retrieval possibilities • Easy to process • Can (optionally) validate against DTDs • Disadvantages: • Remained a niche market in the 1980s • Not well supported by the major document processing vendors, tools expensive Kaiser: COMS E6125
Then Came the Web… • HyperText Markup Language (HTML) is derived from SGML • As an SGML-compliant language, it has a DTD with a fixed set of tags • Initially, the number of tags were very limited ( ~ 10 ) and very easy to remember and to use Kaiser: COMS E6125
HTML Example • From original IETF Internet Draft (1993) for HTML See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details. A <A NAME=serious>serious</A> crime is one which is associated with imprisonment. The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime. Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician. < A HREF="Go">< IMG SRC ="Button"> Press to start</A> Kaiser: COMS E6125
HTML Pro and Con • Advantages • Simple to learn and to use • Easy to create from scratch or by converting legacy text files • Easy to parse and render • Drawbacks • Syntaxless • Much more a presentation language than a structural language • Too limited, not a good substitute for a word processor Kaiser: COMS E6125
HTML History • 1990: First implementation by TBL on a NeXT computer at CERN, using SGML tools to create original HTML language (DTD, parser) • 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific • 1993: NCSA Mosaic • First widely available graphical WWW browser (Unix X-Windows and Mac) • Developed primarily by UIUC undergraduate Marc Andreessen • The killer app of the Internet is born and the number of Web servers explode Kaiser: COMS E6125
HTML History • 1994: Competition • Mosaic team leaves NCSA to found Netscape • Microsoft adopts the Web (Internet Explorer bundled with Windows 95) • Divergence of supported HTML tags between Internet Explorer and Netscape –> browser wars • 1994-1995: HTML 2.0 adds image maps, forms Kaiser: COMS E6125
HTML History • 1995 and beyond: Commercial websites • Java development started (as “Oak”) for programming set-top boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in Sun’s HotJava browser) and May 1995 (in Netscape), BIG SUCCESS • Amazon.com opens in July 1995 • “dot com” era 1995-2001 • 1997: HTML 3.2 and HTML 4.0 add tables, applets, text flow around images, superscripts and subscripts, frames, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization, … (minor updates in HTML 4.01 1999) • Still in progress since ~2006: HTML 5 Kaiser: COMS E6125
XHTML = eXtensible HyperText Markup Language • Jan 2000: XHTML 1.0 W3C Recommendation (updated to XHTML 1.1 in Nov 2010, XHTML 2.0 working group expired in Dec 2010) • Conforms to XML = eXtensible Markup Language • Made element and attribute names case-sensitive (in particular, use lowercase) • Must include end tags, e.g., <p> … </p> • Empty tags must also be closed - add “/” to empty elements, e.g., <br/> and <hr/> • Tags must be nested properly, e.g., <b><i>This is bold and italic</b></i> is invalid • Quote all attribute values, e.g., <img src="duck.jpg" alt="A Duck"/> • Enables using XML parsers on HTML documents • XHTML can be used in conjunction with other XML vocabularies (modules) dedicated to specific applications • Most browsers still work with older HTML, but may handle invalid markup in incompatible ways Kaiser: COMS E6125
What is XML for? • The “universal” markup format for structured documents and data on the Web • For data exchange (messages) and persistent data • Syntax, Data Modeling, Data Processing • Conceptually an SGML descendant • Unlike SGML, it quickly became widespread Kaiser: COMS E6125
SGML->XML • Like SGML, XML is a grammar (or a metalanguage), NOT a specific language • Relatively simple specification • Parsing made simpler through two-level mechanism • Well-formed • Valid Kaiser: COMS E6125
Well-Formed • Contains only properly encoded legal Unicode characters. • None of the special syntax characters such as "<" and "&" appear except when performing their markup-delineation roles. • The begin, end, and empty-element tags that delimit the elements are correctly nested, with none missing and none overlapping. • The element tags are case-sensitive; the beginning and end tags must match exactly. Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot start with -, ., or a numeric digit. • There is a single "root" element that contains all the other elements. • An XML processor that encounters a violation is required to report such errors and to cease normal processing. Kaiser: COMS E6125
Valid • Well-formed, plus • Conforms to a document type • tags and attributes are all declared • tags and attributes are used only in the proper context • XML browsers and other applications usually require validity • Other tools might not (e.g., search engines) Kaiser: COMS E6125
XML more oriented to distributed computing than to document markup Thus complements rather than replaces HTML (or XHTML) DOM = Document Object Model SAX = Simple API for XML SOAP = Simple Object Access Protocol Web Services XML Goes Beyond Document Processing Kaiser: COMS E6125
XML Anatomy element name element attribute name element content <bibliography> <paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> attribute value (attributes cannot contain elements) number content empty element character content Kaiser: COMS E6125
Perspectives on XML • Document (SGML) Community • data = linear text documents • markup (annotate) text to describe context, structure, semantics • Database Community • prominent example of the semi-structured data model • captures the whole spectrum from highly structured, regular data to unstructured data • XML is the cure for your data exchange, information integration, e-commerce, … problems”(also cures baldness, lose 28 pounds in 14 days, get rich quick, …) Kaiser: COMS E6125
Identifying Vocabularies • My element may not be your element: • geometry context: <element>line</element> • chemistry context: <element>oxygen</element> Kaiser: COMS E6125
Identifying Vocabularies • An XML Schema defines a vocabulary of names of type definitions, element and attribute declarations • Use XML Namespaces to identify which vocabulary • Simple method for qualifying element and attribute names used in XML documents • Useful when a single XML document contains elements and attributes that are defined for and used by multiple software modules Kaiser: COMS E6125
XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace The declaration is in scope for the element containing the attribute and all its descendants <html:html xmlns:html='http://www.w3.org/1999/xhtml'> <html:head> <html:title>Frobnostication </html:title> </html:head> <html:body> <html:p>Moved to <html:a href='http://frob. example.com'>here.</html:a> </html:p> </html:body> </html:html> Namespace Scoping Kaiser: COMS E6125
Namespace Defaulting <?xml version="1.1"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body> </html> Kaiser: COMS E6125
Multiple Namespaces All element types are prefixed <bk:bookxmlns:bk='urn:loc.gov:books'xmlns:isbn='urn:ISBN:0-395-36341-6' xmlns:money='urn:Finance:AllAboutMoney'> <bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number> <bk:price money:currencySymbol="$">99.99</bk:price> </bk:book> Kaiser: COMS E6125
Nested Scoping <?xml version="1.1"?> <!-- initially, the default namespace is "books" --> <bookxmlns='urn:loc.gov:books'xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title><isbn:number>1568491379</isbn:number><notes> <!-- make HTML the default namespace for some commentary --> <pxmlns='urn:w3-org-ns:HTML'> This is a <i>funny</i> book! </p></notes> </book> Kaiser: COMS E6125
How to Define the Actual Namespace • W3C namespace specification doesn’t say (!) • A namespace doesn’t actually have to exist as a physical or conceptual entity • All that is needed is a qualifier — the XML namespace URI — that, in combination with an element type or attribute name, creates a universal (and universally unique) name • In other words, there doesn’t actually have to be a definition or anything else at that URI Kaiser: COMS E6125
A <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> B C C A: B: "foo" "foo" "bar" "psl" children are ordered C: "bar" C: "psl" Pure XML - Instance Model • XML 1.0 implicit data model: • nested containers ("boxes within boxes") • labeled ordered trees (= semistructured data model) • Relational or object-oriented easy to encode Kaiser: COMS E6125
XML + Namespaces • Allows mixing of different tag vocabularies • Only identifies the vocabulary (lexicon) • Additional mechanisms required for structure and meaning (or at least metadata) of tags – explicit data model Kaiser: COMS E6125
From Documents to Data <memo importance=‘high’ date=‘2012-01-30'> <from>Gail Kaiser</from> <to>Jon Bell</to> <subject>whim this week</subject> <body>Bring blue books for a surprise quiz!</body> </memo> • We want to be able to • Extract the element structure of a document • Re-use this structure for other similar documents • Share structure and metadata with others • Automate processing of this structure and metadata <invoice> <orderDate>2011-12-01</orderDate> <shipDate>2011-12-26</shipDate> <billingAddress> <name>Gail Kaiser</name> <street>500 West 120th Street</street> <city>New York</city> <state>NY</state> <zip>10027</zip> </billingAddress> <voice>212-555-1234</voice> <fax>212-555-4321</fax> </invoice> Kaiser: COMS E6125
Adding Structure and Semantics • A Document Structure Description (DSD) defines the syntax of XML documents for a particular application domain • Defines the grammar for an XML-based markup language Kaiser: COMS E6125
Processing XML • Non-validating parser: • checks that XML doc is syntactically well-formed, e.g., all open-tags have matching close-tags and they are properly nested, attributes only appear once in an element, etc. • Validating parser: • checks that XML doc is also valid wrt a given DSD (usually XML Schema) Kaiser: COMS E6125
Using DSD Validators • A DSD processor can be useful both on the server side (when writing XML documents) and on the client side (when processing XML documents): • Checking validity (conformance) of XML documents • Performing default insertion (inserts missing fragments) Kaiser: COMS E6125
DSD Processing Kaiser: COMS E6125
Several Proposed DSDs • XML Document Type Definitions (DTDs): • Define the structure of “allowed” documents • Database schema • Non-XML syntax • XML Schemas (XSDs) • Define structure and data types • Allows developers to build their own libraries of interchange-able data types • Written in an XML vocabulary • Others (e.g., RELAX NG, DSD) Kaiser: COMS E6125
XML Schema Design Principles • More expressive than DTDs (from SGML) • Notation is itself an XML vocabulary • Self-describing • Usable by a wide variety of applications that employ XML • Straightforwardly usable on the Internet • Optimized for interoperability • Simple enough to be implemented with modest design and runtime resources • Coordinated with relevant W3C specs Kaiser: COMS E6125
Purpose of an XML Schema • Defines a class of XML instances • Neither instances nor schemas need exist as documents, per se, may exist as: • Byte stream sent between applications • Fields in a database record • Collection of XML “infoset” information items Kaiser: COMS E6125
What is an XML “infoset”? • XML Information Set, 2nd edition, W3C Recommendation February 2004 • For use by other specs that need to refer to the information in a well-formed XML document [or PSVI = post schema validated infoset] • Defines abstract data set generated by parser or by other means, conceptually tree of items each with several properties Kaiser: COMS E6125
(Some) Information Items • Document (root of infoset) – properties include base UR, XML version, character encoding, etc. • One root element - and its children • Attributes of elements • Namespace scoping for elements • Processing instructions • Unexpanded entities (processor may or may not expand all entities) Kaiser: COMS E6125
Example Instance Document <?xml version="1.0"?> <purchaseOrder orderDate=“2008-08-20"> <shipTo country="US"> <name>Robert Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Alice Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> . . . </item> </items> </purchaseOrder> file po.xml
Where is the Schema? • The instance document may reference a schema explicitly, or a processor may obtain a schema separately without reference from the instance • Schema defines elements and attributes, and their complex and simple types • Determines the appearance of elements and their content in instance documents Kaiser: COMS E6125