450 likes | 491 Views
XML Tutorial. Outline. Today’s web: Created by hand for-eyes-only Can HTML become smarter? SGML -> XML The next generation web: XML and component-based commerce Prologue: XML and EDI. A Web Created by Hand for Eyes. Much of the web is “hand-crafted”
E N D
Outline • Today’s web: Created by hand for-eyes-only • Can HTML become smarter? • SGML -> XML • The next generation web: XML and component-based commerce • Prologue: XML and EDI
A Web Created by Hand for Eyes • Much of the web is “hand-crafted” • HTML often exploited and extended to achieve specific layout and formatting • HTML has too low an “Information IQ” to enable many desirable applications
The Limits of “Hand-crafting” Time to Convert Word Processing Documentand Apply HTML Markup (minutes/page) Number of Pages 1 10 60 10 10 minutes 100 minutes 10 hours 100 minutes 16.67 hours 12.5 days 100 16.67 hours 20.83 days 4.17 months 1000 10000 20.83 days 6.94 months 3.47 years 100000 6.94 months 5.79 years 34.72 years
Low vs. High “IQ” Encoding • What information can be encoded? • How adaptable or flexible is the format for encoding style, structure, or markup? • Can the format tell you what it encodes? • ASCII is very low IQ: only character info • SGML is highest IQ: encodes anything and completely specifies the encoding rules • PDF? HTML?
HTML is too low in IQ • HTML was designed as a simple markup language • simple structures: headings, lists, links • strong emphasis on formatting • weak for encoding content • HTML wasn’t designed to encode the structure and semantics needed for complex applications
Web Applications That Need “Smarter” Data • Data interchange between Web clients • Moving processing from server to client • Multiple client-side views w/o new data • “Information push” from personalized applications
Can HTML be made smarter? • Create new tags used by your application, or use <META>, DIV, and CLASS (and hope they don’t interfere elsewhere) • Use a “standard” metadata model (but which one? Dublin Core, PICS, P3, OPS,…) • Hide applet code in comments (platform dependent?) • Hack, hack, hack...
Inherent Limitations of HTML • Not extensible • Limited capability to encode structure • No validation • Lossy interchange
XML • Extensible Markup Language - a standard way of creating markup languages for the Web • a file format for data representation • a schema for describing data or message structures • a mechanism for extending and annotating HTML with semantic information • XML is a simplification of SGML, the Standard Generalized Markup Language • easier to understand and implement
HTML Apartment Listing <HTML> <HEAD> <TITLE>An Apartment For Rent</TITLE> </HEAD> <BODY> <H1>Apartment</H1> <P>1800 square feet, 3 bedrooms, 7 baths. <H2>No pets, smoking forbidden!</H2> <H3>Amenities:</H3> <P> Sunny location, good view, has air-conditioner. <H3>Location</H3> <P>2008 South E. Avenue, Eureka, CA <H3>Cost, Etc.</H3> <P>Price: $3600 a month <P>Contact: (415) 123-4567 <P>Available immediately <P>This offer posted 1 August 1997 in the Eureka Daily Times </BODY> </HTML>
An XML Apartment Listing <?XML VERSION=“1.0”?> <!DOCTYPE APTLISTING SYSTEM “APTLISTING.DTD”> <LISTING> <ADINFO> <POSTED>March 26, 1997</POSTED> <WHERE_POSTED>Belmont Courier</WHERE_POSTED> <CONTACT>(650) 111-2222</CONTACT> </ADINFO> <DESCRIPTION> <AREA>1400 SQUARE FEET</AREA> <AMENITIES>1 bedroom, 1 bathroom</AMENITIES> <COMMENT>Small cottage in a big forest</COMMENT> </DESCRIPTION> <POLICIES> <PETS>Not allowed</PETS> <BOZOS>Not allowed</BOZOS> </POLICIES> <COST>$875</COST> </LISTING>
But First: One Minute SGML • Standard Generalized Markup Language, ISO 8879 • SGML defines the “markup language” that specifies the logical rules for a given type of document • Markup transforms a flat stream of text into a set of objects or elements that can be manipulated by other applications • Since there is no “universal tag set” that can describe all documents, SGML provides the means for defining the tag set that meets your needs
SGML’s Big Idea: Document Types • Idea of document type easy to understand • The Document Type Definition or DTD defines: • the class of documents that shares a common information model • permissible elements and attributes, their contents, the order in which they occur • The DTD is the “document schema” that makes an instance “self-describing” • From a DTD a parser can be generated to test any document for conformance
User manuals Reference manuals Directories Newsletters Brochures Catalogs Datasheets Proposals Dictionaries Technical reports Contracts Regulations Policies and procedures Journal Articles Textbooks Purchase Orders Invoices Recipes Examples of Document Types
HTML as a Document Type • HTML can be described as an application of SGML - the HTML document type • Simple structures: headings, lists, links • Strong emphasis on formatting, weak for encoding content • Not designed to encode the content distinctions for any particular industry or application • But most HTML doesn’t conform to the HTML DTD
Designing a DTD • Determine information requirements, purposes, uses (and their priorities) • deliver in one or more print and online formats • create new information products • interchange with other authors or publishers • integrate information into equipment • meet company, industry, customer standards
Designing a DTD • Determine process, tool, external constraints or standards • Identify and name information components and component containers • Create categories to organize the components • Determine when, where, how often components appear
Designing a DTD • Identify “meta-information” to augment the information components • bibliographic information • process and workflow-related information • Describe the component hierarchy in a graphic notation to visualize it • Transcribe the graphic notation into formal syntax • Test the analysis on sample documents • Document the process and the results
SGML: Close, but no Cigar • SGML has been successful in niches, but hasn’t been adopted by rank-and-file Web publishers • “the quiet revolution” • “the million dollar secret” • Perceived as too complex (because of features dating from keystroke-minimizing origins) • Small vendors didn’t have the clout to legitimize SGML in the mass market (but some of them cleverly “dumbed-down” their tools for HTML)
XML: Right Place, Right Time • Looks like HTML++, but acts like SGML-- • Backed by: • World Wide Web Consortium (W3C) • Sun - “give Java something to do” • Microsoft - with great enthusiasm • Netscape - with less enthusiasm • SGML tool vendors and consultants • Innovators in EDI community
Specific XML Proposals to Simplify SGML • All elements have start and end tags • All attributes are: name=“value” • Changed syntax for EMPTY elements • <toc> => <toc/> • <graphic file=“x.gif”> => <graphic file=“x.gif”/> • No & connector in content models • No inclusions and exclusions • DTD not necessary because it can be inferred if instance is “well-formed”
XML Adoption Scenarios • The transition from the “Web for eyes” to the “automated Web” • 1st generation: XML leaves HTML alone • 2nd generation: HTML as output format created from XML instance • 3rd generation: XML repositories
1st Generation XML • No disruption of existing HTML production processes • XML production process may have nothing to do with HTML production process • XML for processes, HTML for eyes, but XML and HTML can be linked together
1st Generation XML Leaves HTML as is DELIVERY CREATION XML conversion to XML data source conversion to HTML HTML “for eyes”
2nd Generation XML • Creation of XML is primary process • Replace “hand-crafted” HTML with automated down-translation • Alternatively, use XML style sheet to create HTML-like presentation(s) • “instance at a time” retargeting
Up & Down Translation Content/structure-based text objects: SGML, XML, databases Formatted electronic text: HTML, word processing files Easier to translate to Unstructured electronic text: ASCII More structure (energy) Printed text
2nd Generation XML Restores Order XML down translate HTML data source XML source conversion to XML down translate down translate HDML XML style sheet(s) “HTML- like”
HTML as an Output Format • Treating HTML as an output format generated from an SGML source repository insulates you from ongoing changes to HTML and the latest proprietary extensions • HTML created by “down translation” can be richer in structure and more consistent that HTML created by hand at many times the cost
3rd Generation XML • reuse, not just retargeting • XML a first-class citizen from the start • content-oriented DTD • native authoring, or enhanced markup by editorial or production staff • no longer file at a time, create db and work on it • support for custom applications
3rd Generation XML Repository Input 1 Output 1 X M L Input 2 Output 2 “up- translation” or decom-position “down- translation”or assembly Input 3 Output 3 Input 4 Output 4
Retargeting and Reuse Requirements • different delivery channels • Web • CD-ROM, CD-ROM + Web hybrids • Braille, large print, voice synthesis (ICADD) • different “dialects” of HTML for different browsers or bandwidths or as HTML changes • different applications (“slice and dice”) • reference manual vs help vs tutorial
XML for the Web’s “Little Languages” • CDF -- “channel definition format”, eliminates need for proprietary “push” plug-in • OSD -- “open software description”, for describing configurations for automated distribution of software • PICS -- for content ratings • RDF -- “resouce description framework”, merging Netscape and Microsoft metadata initiatives • CBL -- common business language in eCo framework
The Next-Generation Web PROBLEMS SOLUTIONS Metadata and Object APIs -- “self-describing smart Web” The Web is eyeballs-only No content encoding Web catalogs and documents in their “native schema” Distributed registries and structure-based retrieval Things can’t be found Agent-based run-time environment No automation of tasks
The Internet Today Database FTP Server Application Web ServerDocuments Web ServerDocuments Web ServerDocuments Application Database
A Commerce Type Definition (CTD) <!Doctype Taxonomy public "-//CommerceNet//DTD Taxonomy V1.0//EN"> <Taxonomy> <Head> <Label>United Airlines</Label> <Version>1.0</Version> <Base>World Airline Registry:1.1:2.3.7</Base> <Registry>toe.commerce.net:2111</Registry> </Head> <Body> <Services> <Passenger_Flight_Information> <Flight_Number>UA #200</Flight_Number> <Flight_Price US>$168.50</Flight_Price US> <Flight_Dest>Honolulu, Hawaii</Flight_Dest> </Passenger_Flight_Information> <Cargo_Flight_Information> </Cargo_Flight_Information> </Services> </Body> </Taxonomy>
Step 1: XML Metadata CTD CTD CTD Database FTP Server Application CTD CTD Web ServerDocuments CTD Web ServerDocuments CTD CTD Web ServerDocuments Application Database
Step 2: Registries CTD CTD CTD Database Registry FTP Server Application CTD CTD Registry Web ServerDocuments CTD Registry Web ServerDocuments CTD CTD Web ServerDocuments Application Database Registry
Common Business Language (CBL) • Who am I? • Company name, contact, public key certificates • What am I? • Agent/object (API), document (DTD), database (schema) • Available data • Product list, price list, terms and conditions, catalog, order form • Available services • Buy, sell, RFQ, search catalog
Step 3: CBL Components CTD CTD CTD Database Registry FTP Server Application CTD CTD Registry Web ServerDocuments CTD Registry Web ServerDocuments CTD CTD Web ServerDocuments Application Database Registry
Step 4: Agents CTD CTD CTD Agent Database Registry FTP Server Application CTD CTD Agent Registry Web ServerDocuments CTD Registry Web ServerDocuments CTD CTD Agent Web ServerDocuments Application Database Registry
Step 5: Business Services Matchmaking Services CTD CTD CTD Agent Database Registry FTP Server Application CTD CTD Agent Registry Web ServerDocuments CTD Registry Web ServerDocuments CTD CTD Agent Web ServerDocuments Application Database Trust Intermediaries Registry
Wrapping Up • HTML will continue to exist, but most serious publishers will produce HTML and XML versions of their content from the same “smarter” source • XML unifies document and database perspectives and tools for Web publishing and lets them be automated in the same way
Prologue: XML and EDI • XML appeals to the EDI community because: • it reinforces the move to Internet EDI • it suggests a way to make transaction sets easier to define and “self-describing” • But which kind of XML/EDI? • incremental strategy of wrapping existing EDI transactions in XML syntax • radical re-thinking of EDI to create XML “fragments” for transaction components that are dynamically combined as needed
Learning More • The “mother of all information” about XML is the “SGML Home Page” - www.sil.org/sgml/xml.html • Best overall book for managers to get started with SGML and XML is ABCD…SGML by Liora Alschuler • Best overall book for HTML-savvy types is SGML on the Web by Yuri Rubinsky & Murray Maloney