660 likes | 859 Views
Introduction to the eXtensible Markup Language (XML). Instructor: Joseph DiVerdi, Ph.D., M.B.A. Background & Context. HTML follows the rules of formal electronic document-markup design & implementation Born out of the need to Assemble text, graphics, & other digital content
E N D
Introduction to the eXtensible Markup Language (XML) Instructor: Joseph DiVerdi, Ph.D., M.B.A.
Background & Context • HTML follows the rules of formal electronic document-markup design & implementation • Born out of the need to • Assemble text, graphics, & other digital content • For transmission over the Internet • HTML v4.01 standard is defined using • Standardized Generalized Markup Language • SGML • Adequate for formalizing HTML • Too complex for extending HTML
Background & Content • eXtensible Markup Language • Based on simpler features of SGML • Kinder, gentler, & more flexible • Well-suited for orderly development of new markup languages • HTML is even being reborn as XHTML
Background & Context • With XML there exists a standardized means for defining markup languages • That are customized for different needs • Rather than relying upon HTML extensions • Mathematicians express mathematical notations • Musicians present musical scores • Physicians exchange medical records • Accountants share financial information • All groups need an acceptable, resilient way to express these different kinds of information, so software can be developed to process & display these diverse data
Background & Context • XML provides a solution • Each content sector • business group, trade association, consortium.. • can define a markup language • for information exchange & processing over the Web • Programmers can develop parsers • XML-compliant processes • that read new language definitions & • permit a server to process documents in those languages • permit a client to retrieve & display those documents
Background on SGML • Standard Generalized Markup Language • SGML • International standard (ISO 8879) • Published in 1986 • SGML prescribes a standard format for embedding descriptive markup in a document • SGML also specifies a standard method for describing the structure of a document • More important & crucial to its power
SGML Background • SGML allows an author to set up hierarchical models for each type of document produced • SGML forces each element in the structure • Labeled with descriptive markup such as chapter, title & paragraph • To fit in the logical, predictable structure of the document
SGML Background • SGML supports an unlimited variety of document structures • Users typically design a different document structure for each category of information they produce: • information bulletins • technical manuals • parts catalogs • design specifications • reports • letters & memos
SGML Background (con't) • SGML allows authors to create documents that are independent of any specific hardware or software • Since SGML documents conform to an international standard • They are portable • They can be exchanged seamlessly with users who have different systems
How does SGML work? • A document can be broken into three layers: Structure Content Style • SGML separates these three aspects • Deals mainly with the relationship between structure & content
SGML & Structure • File called the DTD Document Type Definition • DTD describes the structure of a document • Describes types of information handled & relationships among fields • Like a database schema • DTD provides a framework for the elements • Chapters, chapter headings, sections, and topics • That together constitute a document
SGML & Structure • DTD also specifies rules for the relationships between elements • A chapter heading must be the first element after the start of a chapter • Each list must contain at least two items. • These rules ensure that documents have a consistent, logical structure • A DTD accompanies a document everywhere • A document instance is a document whose content has been tagged in conformance with a particular DTD
SGML & Content • Content is the information itself Titles, paragraphs, lists, tables, graphics, & audio • The method for identifying the content's position within the DTD structure is called tagging • Creating an SGML document involves inserting tags around content • These tags mark the beginning and end of each part of the structure
SGML & Content • <PAR> indicates the start of a paragraph & </PAR> indicates the end <PAR>Content is the information itself.</PAR> • Elements can be nested in other elements • The paragraph (<PAR>) is an element within the topic (<topic>) <TOPIC> <PAR> Content is the information itself. </PAR> </TOPIC>
SGML & Content • The structure of a particular document is revealed by the nesting of tags: <section> <subhead> Content </subhead> <par> Content is the information itself. </par> </section>
SGML & Content • Some SGML-based authoring software programs rely on a software module called a parser that verifies that the document follows the rules of the DTD • The parser also verifies that the DTD itself is structurally correct
SGML & Style • SGML itself has nothing to do with setting standards for style • Most systems still rely on proprietary methods :( • Two efforts to develop standards-based style sheets have resulted in the mature OS & the newly released DSSSL • Document Style Semantics & Specification Language • Complex formatting language • Difficult to learn & implement • XSL inherits & simplifies many formatting concepts • eXtensible Stylesheet Language
SGML & HTML • When the creators of the WWW needed a markup language to instruct browsers how to display WWW content they used SGML guidelines to create HTML • Hyper Text Markup Language • HTML was designed specifically for displaying content in a browser • But isn't much good for anything else
Progress Marches On • The WWW has matured & is being used for more than just viewing text and images • More versatile markup languages are needed
Limitations of HTML • HTML was designed so that tags would be used to mark up information according to its meaning • Without regard to how this info would be rendered in a browser • The title, main header, emphasized text ,and contact information of the author are placed inside the elements TITLE, H1, EM, & ADDRESS • Remember SGML structure & content
Limitations of HTML • Each browser should decide how to display marked up text because it knows about the user's preferences & environment and can make decisions based on that information • Without this information, the author cannot do this as well • People who are blind • People who run non-graphical browsers • People who have weak eyesight • Need larger font sizes
Limitations of HTML • Using FONT, I, or other elements to control layout optimizes presentation for a limited number of environments reduces the content's portability • Problems for those readers who operate in a non-standard environment
Limitations of HTML • Browsers have their own elements and attributes whose only purpose is to specify the layout, like FONT, CENTER, BGCOLORetc. • Browser vendors have ignored standards, like CSS, that tried to segregate information about layout from the HTML documents • HTML editors produce HTML where the markup is presentational rather than semantic
Limitations of HTML • The result is that many pages on the web now contain tags written for a specific version of a specific browser & a specific screen resolution with default preferences • These pages are often more or less unreadable to those who use something else anything besides that configuration • HTML has gradually been turned into a presentational language for Netscape & Explorer by the vendors & their users
Limitations of HTML • HTML offers only a limited number of tags for specialized uses • Chemistry • elements for chemical formulas • for measurement data • Airplane manufacturer • engines, parts & models • Stock Broker • opening price, closing price, daily high, etc.
Limitations of HTML • HTML has limited internal structure • It's easy to write valid HTML with semantic nonsense • H2->H1->H3->/H3->/H1->/H2 • Consider the English language equivalent • book title->part title->chapter title • Processing HTML information automatically also becomes difficult or even impossible because of its intrinsic structure
Solution: Just Extend HTML • HTML is already overburdened with dozens of interesting but incompatible inventions from different manufacturers, because it provides only one way of describing your information • HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure
Solution: Just Use "Word" • Information on a network which connects many different types of computer has to be usable on all of them • It is also helpful for such information to be in a form that can be reused in many different ways • Minimize wasted time & effort
Solution: Just Use Word • Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands • Proprietary data formats, no matter how well documented or publicized, are simply not an option • Their control still resides in private hands & • They can be changed or withdrawn • arbitrarily & without notice
Solution: Go Back to SGML • SGML is the international standard for defining this kind of application • Those who need an alternative based on different software for other purposes are entirely free to implement similar services using such a system, especially if they are for private use
XML Defined • XML is a portable, WWW-specific SGML • Powerful enough to describe data • Light enough to travel across the Web • SGML with a reduced feature set • Extensible because it is not a fixed format • Not a single, predefined markup language • It's a meta-language • A language for describing other languages
XML Defined • XML documents can reside on a server & be converted to HTML for viewing by browsers if required • Browsers can be XML compliant and access XML documents directly if required
Role of XML Development • It removes two constraints which are holding back Web development: • Dependence on a single, inflexible document type (HTML) • The complexity of full SGML, whose syntax allows many powerful but hard-to-program options. • XML simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.
A Reminder • C, C++, Fortran, Pascal, Basic, Java • programming languages with which calculations are specified, actions, and decisions are made • SGML, XML, HTML • markup specification languages with which ways of describing information, usually for storage, transmission, or processing by a program can be designed • Markup Languages don't do anything alone • a program must be run to do something with them
XML Defined (Again) • The main point of XML is that the author, by defining a markup language, can encode the information of documents much more insightfully than is possible with HTML • This means that programs processing these documents can understand them much better and therefore process the information in ways that are impossible with HTML (or ordinary text processor documents)
Example: Recipe Manager • Marked up recipes (for, say, soups and seafood dishes etc) according to a definition tailored for recipes • Contain the ingredients, amounts of each and alternatives for some • A program that, with a list of your fridge contents, goes through the recipes and makes a list of the possible recipes
Example: Recipe Manager • With nutritional information about the ingredients another program could sort the dishes by the number of calories • Or by how long they'd take to prepare • Or the price of the ingredients • The possibilities are many, because the information is encoded in a way that the computer can more easily "understand"
Example: Tax forms in XML • How to "automate" tax processing systems? • Tax laws are complex • Tax laws change frequently • Tax forms also change frequently • Form user interface code would have to change frequently • Validating and processing applications would have to change frequently
Example: Tax forms in XML • Express the form itself as an XML document • described all the fields • the text in the form • the relationships between the fields • The user interface code for web submission could then use this information in a Java applet to set up the user interface correctly • The validation application could use it to validate received information
Example: Tax forms in XML • Some of the constraints that can expressed in an XML document are: • that field X is the sum of fields W, Y and Z • that field X should contain Y percent of the amount in field Z • that the value of field X should be between Y and Z • that fields X and Y should contain the same value that if the value in field X is Y, then fields W-Z should not be filled in
Example: Tax forms in XML • These should all be easily expressible in XML, and the resulting documents should be simple enough that non-programmers can modify them when needed. • Changes to the forms could then be effected by modifying the XML document, without changing any of the application code
Example: FAQ Maintenance • Using an XML structure an FAQ-maintainer could also be rid of the problems with maintaining the FAQ in HTML, TEXT, and PDF versions • Instead the maintainer can make one or more stylesheets to be run each time the original has been updated to create new versions of the distribution files
Example XML File <?xml version="1.0" standalone="yes"?> <!-- file name: inventory.xml --> <INVENTORY> <BOOK> . . </BOOK> <BOOK> . . </BOOK> </INVENTORY>
Example XML File <BOOK> <TITLE>The Legend of Sleepy Hollow</TITLE> <AUTHOR>Washington Irving</AUTHOR> <BINDING>mass market paperback</BINDING> <PRICE>$2.95</PRICE> </BOOK> <BOOK> <TITLE>Leaves of Grass</TITLE> <AUTHOR BORN="1819">Walt Whitman</AUTHOR> <BINDING>hardcover</BINDING> <PRICE>$7.75</PRICE> </BOOK>
Example XML File w/ CCS <?xml version="1.0" standalone="yes"?> <!-- file name: inventory.xml --> <?xml-stylesheet type="text/css" href="inventory.css"?> <INVENTORY> <BOOK> . . </BOOK> <BOOK> . . </BOOK> </INVENTORY>
Example CSS /* file name: inventory.css */ BOOK { display: block; margin-top: 12pt; font-size: 10pt } TITLE { display: block; font-size: 10pt; font-weight: bold; font-style: italic } AUTHOR { display: block; margin-left: 15pt; font-weight: bold } BINDING { display: block; margin-left: 15pt } PAGES { display: none } PRICE { display: block; margin-left: 15pt }
Example XML File w/DTD <?xml version="1.0" standalone="no"?> <!-- file name: inventory.xml --> <?xml-stylesheet type="text/css" href="inventory.css"?> <!DOCTYPE book_inventory SYSTEM "inventory.dtd"> <INVENTORY> <BOOK> . . </BOOK> </INVENTORY>
Example DTD /* file name: inventory.dtd */ <!ELEMENT INVENTORY (BOOK+)> <!ELEMENT BOOK (TITLE AUTHOR BINDING PAGES PRICE)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (#PCDATA)> <!ELEMENT BINDING (#PCDATA)> <!ELEMENT PAGES (#PCDATA)> <!ELEMENT PRICE (#PCDATA)>
XML Browser Issues • The XML specification is still relatively new • Much XML is experimental • There won't be just one browser, but many • Because the potential number of different XML applications is not limited, no single browser can be expected to handle 100% of everything
XML Browser Issues • IE5.5 handles XML but currently still renders it via the CSS model even when using an XSL stylesheet • Not all the stylesheet options work • Microsoft was also one of the architects of a invalid hybrid solution in which one could embed fragments of XML in HTML files • Current HTML-only browsers simply ignore element markup which they don't recognize • This has now been superseded by XHTML