760 likes | 940 Views
COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2008. Today’s Topic: Markup Languages. History of markup languages SGML = Standard Generalized Markup Language HTML = HyperText Markup Language XML = eXtensible Markup Language. What is Markup?.
E N D
COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008 Kaiser: COMS E6125
Today’s Topic: Markup Languages • History of markup languages • SGML = Standard Generalized Markup Language • HTML = HyperText Markup Language • XML = eXtensible Markup Language Kaiser: COMS E6125
What is Markup? • Special text (“mark”) that is added to the regular text of a document in order to convey some information about it • A markup language is a formalized way of providing markup, and specifies: • what markup is allowed (the lexicon) • what markup is required • how markup is distinguished from content text • what the markup “means” Kaiser: COMS E6125
Specific Coding • Historically, electronic manuscripts contained procedural control codes (markup) that caused the text to be formatted in a particular way • tj6 • troff • TeX Kaiser: COMS E6125
Procedural Markup • Advantages: • Instructs agent how to process text • Generally concerned with formatting and presentation • Is “efficient” because requires little further interpretation • Disadvantages • Often specific to one proprietary processing system • Usually ties a document to a single purpose • printing on a paper • viewing on a screen • provides no information on “meaning” Kaiser: COMS E6125
Markup Steps • Author first analyzes the information structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type • Author then determines, from memory or a style book, the processing instructions (“marks”) that will produce the format desired for that type of element • Finally, s/he inserts the chosen marks into the text Kaiser: COMS E6125
Example Specific Coding .SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes: .TB 4 TaB stop .OF 4 OFfset .SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed on those elements. .OF 0 .SK 1 SKipping vertical space Kaiser: COMS E6125
Generic Coding • In contrast, generic (or generalized, or descriptive) coding uses descriptive tags (e.g., “heading”) • Scribe • LaTeX • HTML Kaiser: COMS E6125
Descriptive Markup • Advantages: • Identifies the logical components of a document • Generally concerned with what text is • Does not specify what procedures are to be applied to text • Therefore requires that other process(es) supply formatting and presentation Kaiser: COMS E6125
Descriptive Markup • Disadvantages • Is (usually) human and machine readable • Identifies information content • Is not directed towards a particular purpose or rendition of the document • Therefore can be non-proprietary Kaiser: COMS E6125
Markup Steps • Author first analyzes the information structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type same as above • Author then associates each significant element with the mnemonic tag (“mark”) that s/he feels best characterizes it Kaiser: COMS E6125
Example Generic Coding <p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes: <ol> <li>Separating the logical elements of the document; and <li>Specifying the processing functions to be performed on those elements. </ol> Kaiser: COMS E6125
The Case for Generalized Markup • Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, so markup need be done only once and will suffice for all future processing • Markup should be rigorous so that the techniques available for rigorously-defined objects like programs and data bases can be used for processing documents as well Kaiser: COMS E6125
Who Invented Markup? • Specialized markup: ??? • Generalized markup: • Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967 • Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970 Kaiser: COMS E6125
An Early Implementation • At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of a law office project integrating text editing with information retrieval and page composition • Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure • By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given head-level to be automatically formatted identically • Productized in 1973 in IBM’s Document Composition Facility (DCF) Kaiser: COMS E6125
Example GML :h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one), :li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized and formalized in SGML), allowed the end-tags to be omitted for the "h1" and "p" elements. Kaiser: COMS E6125
SGML = Standard GML • Standardization effort started in 1978, when ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee • Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD), ISO (International Standard Organization joins ANSI effort in 1984 • Final international standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN) • Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989? Kaiser: COMS E6125
SGML • A metalanguage (grammar) • How to write tags, how to define the document structure • Structural paradigm is that of • an inverted tree structure, a root component branching out into leaves • or a series of nested containers • Defines three kinds of objects • Elements are the basic structural components • Attributes are qualities of elements • Entities are a short representation of special characters Kaiser: COMS E6125
SGML Pro and Con • Advantages: • Documents held in a standards-based, non-proprietary, platform-independent storage format • Scope for document re-use and re-presentation, enhancement of retrieval possibilities • Easy to process • Can (optionally) validate against DTDs • Disadvantages: • Remained a niche market in the 1980s, unknown to the masses • Not well supported by the major document processing vendors, tools expensive Kaiser: COMS E6125
Then Came the Web… • HyperText Markup Language (HTML) is derived from SGML • As an SGML-compliant language, it has a DTD with a fixed set of tags • Initially, the number of tags were very limited ( ~ 10 ) and very easy to remember and to use Kaiser: COMS E6125
HTML Example <html> <head> <title> My title </title> </head> <body> <h1> A huge heading </h1> <h2> A smaller one </h2> <ul> <li> a list item in <b>bold</b> </li> <li> a list item in <i>italics</i> </li> </ul> <p> A paragraph </p> </body> </html> Kaiser: COMS E6125
Another HTML Example • From original IETF Internet Draft for HTML See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details. A <A NAME=serious>serious</A> crime is one which is associated with imprisonment. The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime. Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician. < A HREF="Go">< IMG SRC ="Button"> Press to start</A> Kaiser: COMS E6125
HTML Pro and Con • Advantages • Simple to learn and to use • Easy to create from scratch or by converting legacy text files • Easy to parse and render • Drawbacks • Syntaxless • Much more a presentation language than a structural language • Too limited, not a good substitute for a word processor Kaiser: COMS E6125
HTML History • 1990: First implementation by TBL on a NeXT computer at CERN • Used SGML tools to create original HTML language (DTD, parser) • Scalability and simplicity of HTML (and HTTP), compared to OHS or Gopher part of the basis for WWW success • 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific Kaiser: COMS E6125
HTML History • 1993: NCSA Mosaic • First widely available graphical WWW browser (Unix X-Windows and Mac) • Developed primarily by UIUC undergraduate Marc Andreessen • The killer application of the Internet is born and the number of Web servers explode • 1994: Competition • Mosaic team leaves NCSA to found Netscape • Microsoft adopts the Web (Internet Explorer bundled with Windows 95) • Divergence of supported HTML tags between Internet Explorer and Netscape –> browser wars • HTTP traffic becomes more common than telnet and ftp Kaiser: COMS E6125
HTML History • 1994-1995: HTML 2.0 adds image maps, forms • 1995 and beyond: Commercial websites • Java development started (as “Oak”) for programming settop boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in HotJava) and May 1995 (in Netscape), BIG SUCCESS • Amazon.com opens in July 1995 • “dot com” era begins (and soon ends) Kaiser: COMS E6125
HTML History • Jan 1997: HTML 3.2 adds tables, applets, text flow around images, superscripts and subscripts • Dec 1997: HTML 4.0 addsframes, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization Kaiser: COMS E6125
XHTML = eXtensible HyperText Markup Language • XHTML 1.0 W3C Recommendation January 2000, revised August 2002 (XHTML 1.1 still working draft) • Made element and attribute names case-sensitive (in particular, use lowercase) • Include end tags, e.g., <p> … </p> • Add a “/” to empty elements, e.g., <br/> and <hr/> • Quote all attribute values, e.g., <img src="duck.jpg" alt="A Duck"/> • Most browsers still work fine with older HTML Kaiser: COMS E6125
Where did the “X” come from? • XML = eXtensible Markup Language • XHTML is a reformulation of HTML 4.x in XML • XHTML can be used in conjunction with other XML vocabularies • SMIL (Synchronized Multimedia Integration Language) • SVG (Scalable Vector Graphics) • MathML (Mathematical Markup Language) • Plus hundreds dedicated to specific applications (the extensible part) Kaiser: COMS E6125
What is XML for? • The universal markup format for structured documents and data on the Web • For data exchange (messages) and persistent data • Syntax • Data Modeling • Data Processing Kaiser: COMS E6125
XML History • XML 1.0 became a W3C Recommendation in February 1998, revised several times - most recently September 2006 • XML 1.1 draft released Nov 2003, recommendation last revised September 2006 (addresses various issues wrt Unicode and mainframe compatibility) • Conceptually an SGML descendant • Unlike SGML, it quickly became widespread Kaiser: COMS E6125
SGML->XML • Like SGML, XML is a grammar (or a metalanguage), NOT a specific language • Specification simplified • SGML spec ~600 pages • XML spec 36 pages (initial 1.0) -> 54 pages (1.1 2nd edition) • Parsing made simpler through two-level mechanism • Well-formed • Valid Kaiser: COMS E6125
Well-Formed • (Optionally) starts with XML declaration <?xml version="1.0"?> • Rest of document inside the root element <myroot>…</myroot> • All text contained in some element <someelement>text text text</someelement> • Explicit empty elements <anotherelement></anotherelement> <anotherelement/> Kaiser: COMS E6125
Well-Formed • Element tags must be properly nested (no crossing tags) NO <i><b>blah blah blah</i></b> • Start and end tags must match exactly (same case) • Quotes placed around all attribute values <a href=“stuff.html”>stuff</a> Kaiser: COMS E6125
Valid • Well-formed, plus • Conforms to a DTD or Schema • tags and attributes are all declared • tags and attributes are used correctly • XML browsers and editors usually require validity • Other tools might not (e.g., search engines) Kaiser: COMS E6125
XML more oriented to distributed computing than to document markup Thus complements rather than replaces HTML (or XHTML) DOM = Document Object Model SAX = Simple API for XML SOAP = Simple Object Access Protocol Web Services XML Goes Beyond Document Processing Kaiser: COMS E6125
Let’s Reinvent XML • Someone in the far future sends a message in a virtual bottle, containing parts of the universal library of human and post-human literature, back into the 1970s when ... • … the Web, XML, P2P, Java were unheard of • ... computer manufacturers talked about mips and kilobytes • … music was played by rotating vinyl discs under a diamond-tip stylus or on cassette tapes Kaiser: COMS E6125
… and Microsoft looked like Kaiser: COMS E6125
The Message in the Bottle, 1st try ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@^@ ^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ Kaiser: COMS E6125
The Message in the Bottle, 2nd try \documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ... \section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ … \end{verse} \bibliographystyle{abbrv} \bibliography{msg} \end{document} Kaiser: COMS E6125
The Message in the Bottle, finally <?xml version=“1234.56"?> <universal_library> <books> <book> <title>Some Quotations from the Universal Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate. </line> <line>Rough winds do shake the darling buds of May, </line> … </verse> </section> </book> … </books> </universal_library> Kaiser: COMS E6125
XML as a Self-DescribingData Exchange Format • Someone from the 1970s receives the message in the virtual bottle, and it … • … can be easily “understood” (even using CP/M & edlin) • … can be parsed easily • … allows the application programmer to rediscover schema and semantics (sort of…) • … may include an explicit schema description • … allows separation of marked-up content from presentation Kaiser: COMS E6125
XML Anatomy element name element attribute name element content <bibliography> <paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> attribute value (attributes cannot contain elements) number content empty element character content Kaiser: COMS E6125
Perspectives on XML • Document (SGML) Community • data = linear text documents • markup (annotate) text to describe context, structure, semantics • Database Community • XML as a prominent example of the semi-structured data model • captures the whole spectrum from highly structured, regular data to unstructured data • XML is the cure for your data exchange, information integration, e-commerce, … problems”(also cures baldness, lose 28 pounds in 14 days, get rich quick, …) Kaiser: COMS E6125
A <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> B C C A: B: "foo" "foo" "bar" "psl" children are ordered C: "bar" C: "psl" Pure XML - Instance Model • XML 1.0 implicit data model (infoset): • nested containers ("boxes within boxes") • labeled ordered trees (= semistructured data model) • relational, object-oriented easy to encode Kaiser: COMS E6125
Identifying Vocabularies • My element may not be your element: • geometry context: <element>line</element> • chemistry context: <element>oxygen</element> Kaiser: COMS E6125
Identifying Vocabularies • An XML Schema (with XML 1.1) defines a vocabulary of names of type definitions, element and attribute declarations [Schema ~= new improved DTD] • Use XML Namespaces(with XML 1.1) to identify which vocabulary • Simple method for qualifying element and attribute names used in XML documents • Useful when a single XML document contains elements and attributes that are defined for and used by multiple software modules Kaiser: COMS E6125
XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace The declaration is in scope for the element containing the attribute and all its descendants <html:html xmlns:html='http://www.w3.org/1999/xhtml'> <html:head> <html:title>Frobnostication </html:title> </html:head> <html:body> <html:p>Moved to <html:a href='http://frob. example.com'>here.</html:a> </html:p> </html:body> </html:html> Namespace Scoping Kaiser: COMS E6125
Namespace Defaulting <?xml version="1.1"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body> </html> Kaiser: COMS E6125
Multiple Namespaces All element types are prefixed <bk:bookxmlns:bk='urn:loc.gov:books'xmlns:isbn='urn:ISBN:0-395-36341-6' xmlns:money='urn:Finance:AllAboutMoney'> <bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number> <bk:price money:currencySymbol="$">99.99</bk:price> </bk:book> Kaiser: COMS E6125