550 likes | 563 Views
Learn the basics of XML, including tagging, element relationships, and document structure. Discover why XML is important and how it differs from databases. See examples and best practices for creating well-formed and valid XML documents.
E N D
XML inLibraries Roy Tennant eScholarship California Digital Library escholarship.cdlib.org
Outline • Introduction to XML • XML vs. Databases • Displaying and Transforming XML • XSLT Primer • Serving XML to Web Users • Case Studies • Tips & Advice • Resources
Introduction to XML • Extensible Markup Language • A method of creating and using tags (elements) to identify the structure and contents of a document — not how it should be displayed • The tags can be arbitrary or can come from a specification (a Document Type Definition or schema)
What it Looks Like <?xml version=“1.0” encoding=“UTF-8” ?> <book> <author> <lastname>Tennant</lastname> <firstname>Roy</firstname> </author> <title>The Great American Novel</title> <chapter number=“1”> <chaptitle>It Was Dark and Stormy</chaptitle> <p>It was a dark and stormy night.</p> </chapter> </book>
Why is XML Important? • It is a standard, easily extensible way to encode loosely-structured as well as highly-structured information • Due to its easy parseability, software can transform it in countless ways, thereby allowing: • Easy migration paths • Alternative displays • On-the-fly response to user needs
Ways to Use XML • Behind the scenes as a standard and easily transformed format for information • As a transfer syntax, to exchange information in a machine-parseable form • As a method of delivery direct to the user (not recommended)
Documents • XML is expressed as “documents”, whether an entire book or a database record • Must haves: • At least one element • Only one “root” element • Should haves: • A document type declaration; e.g.,<?xml version="1.0"?> • Namespace declarations • Can haves: • One or more properly nested elements • Comments • Processing instructions
Elements • Must have a name; e.g., <title> • Names must follow rules: no spaces or special characters, must start with a letter, are case sensitive • Must have a beginning and end; <title></title> or <title/> • May wrap text data; e.g., <title>Hamlet</title> • May have an attribute that must be quoted; e.g., <title level=“main”>Hamlet</title> • May contain other “child” elements; e.g., <title level=“main”>Hamlet <subtitle>Prince of Denmark</subtitle></title>
Element Relationships • Every XML document must have only one “root” element • All other elements must be contained within the root • An element contained within another tag is called a “child” of the container element • An element that contains another tag is called the “parent” of the contained element • Two elements that share the same parent are called “siblings”
The Tree <?xml version="1.0"?> <book> <author> <lastname>Tennant</lastname> <firstname>Roy</firstname> </author> <title>The Great American Novel</title> <chapter number=“1”> <chaptitle>It Was Dark and Stormy</chaptitle> <p> It was a dark and stormy night.</p> </chapter> </book> Root element Parent of <lastname> Siblings Child of <author>
Character Entities • There are 5 characters that are reserved for special purposes; therefore, to use these characters when not part of XML tags, you must use an entity reference: • & (ampersand) becomes: & • < (less than) becomes: < • > (greater than) becomes: > • ‘ (apostrophe) becomes: ' • “ (quote) becomes: "
Two Types of XML • Well-Formed • Valid
Well-Formed XML • Follows general tagging rules: • All tags begin and end • But can be minimized if empty: <br/> instead of <br></br> • All tags are case sensitive • All tags must be properly nested: • <author> <firstname>Mark</firstname><lastname>Twain</lastname> </author> • All attribute values are quoted: • <subject scheme=“LCSH”>Music</subject> • Has identification & declaration tags • Software can make sure a document follows these rules
Valid XML • Uses only specific tags and rules as codified by one of: • A document type definition (DTD) • A schema definition • Only the tags listed by the schema or DTD can be used • Software can take a DTD or schema and verify that a document adheres to the rules • Editing software can prevent an author from using anything except allowed tags
XML vs. Databases(a simplistic formula) • If your information is… • Tightly structured • Fixed field length • Massive numbers of individual items • You need a database • If your information is… • Loosely structured • Variable field length • Massive record size • You need XML
Displaying XML: CSS • A modern web browser (MSIE, Mozilla) and a cascading style sheet (CSS) may be used to view XML as if it were HTML • A style must be defined for every XML tag • All display characteristics of each element must be explicitly defined • Elements are displayed in the order they are encountered in the XML • No reordering of elements or other processing is possible
Transforming XML: XSLT • XML Stylesheet Language — Transformations (XSLT) • A markup language and programming syntax for processing XML • Is most often used to: • Transform XML to HTML for delivery to standard web clients • Transform XML from one set of XML tags to another • Transform XML into another syntax/system
XLST Primer • XSLT is based on the process of matching templates to nodes of the XML tree • Working down from the top, XSLT tries to match segments of code to: • The root element • Any child node • And on down through the document • You can specify different processing for each element if you wish
XSLT Primer: Beginning Syntax • Start with the XSLT namespace declaration:<xsl:stylesheet version="1.1” xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:xsd="http://www.w3.org/2001/XMLSchema"></xsl:stylesheet> • Then add, which matches the root node:<xsl:template match=“/”> <xsl:apply-templates/></xsl:template>
XSLT Primer: Templates • Then add, for each element you wish to process:<xsl:template match=“ELEMENTNAME”> XSLT INTSTRUCTIONS AND/OR HTML HERE</xsl:template> • If you want to process nodes below this point (children), use the <xsl:apply-templates/> instruction
XSLT Primer: Doing HTML • Typical way to begin:<xsl:template match="/"> <html> <head> <title><xsl:value-of select="title"/></title> <link type="text/css" rel="stylesheet" href="xslt.css" /> </head> <body> <xsl:apply-templates/> </body> </html></xsl:template> • Then, templates for each element appear below
Serving XML to Web Users • Basic requirements: an XML doc and a web server • Additional requirements for simple method: • A CSS Stylesheet • Additional requirements for complex, powerful method: • An XSLT stylesheet • An XML parser (we’ll use xsltproc) • A CGI program to take user input and call xsltproc • A CSS stylesheet (optional) to control how itlooks in a browser
XML Web Publishing Software • Software used to add XML serving capability to a web server • A couple examples, both free: • Cocoon (xml.apache.org/cocoon/) • Requires a Java servlet container such as Tomcat (free) or Resin (commercial) • AxKit (axkit.org) • Requires Perl
AxKit mod_perl Web Server
Cocoon Tomcat Web Server
Java Servlet Resin Web Server
Java Servlet Resin Web Server I want this XML doc…
XSLT Stylesheet XML Doc Java Servlet Resin Web Server
XSLT Stylesheet XML Doc XHTML Document (no displaymarkup)* Java Servlet Resin HTML Stylesheet (CSS) Web Server * Dynamic document
Transformation XSLT Stylesheet Information Presentation XML Doc XHTML Document (no displaymarkup)* Java Servlet Resin HTML Stylesheet (CSS) Web Server * Dynamic document
Case Study: Publishing Books @ the California Digital Library • Goals: • To create highly usable online versions of books • To create versions that will migrate easily as technology changes • To create an infrastructure that will support dynamic presentations of the same content
File System Encodedin TEIXML Stored Search Index Full Text
File System Encodedin TEIXML Stored Search Index Full Text Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Project Profile MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Project Profile Userqueries MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Project Profile Search Results MODS record User requests book UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Javaservlet Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored User requestsbook segment Project Profile MODS record METS record in XML UC Press record XSLT Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Javaservlet Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored XSLT Project Profile Booksegmentreturned MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Project Profile Userqueries MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Project Profile Resultsreturned MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Javaservlet Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored User wants to seesearch wordsin context Project Profile MODS record UC Press record Library Catalog UC PressDatabase
File System Encodedin TEIXML Stored Search Index Full Text Javaservlet Structure Search Index SelectedFieldsExtracted METSRepository RecordsCreated Stored Booksegmentreturnedw/termshighlighted XSLT Project Profile MODS record UC Press record Library Catalog UC PressDatabase