1.95k likes | 1.96k Views
This lecture introduces core XML concepts including syntax, well-formedness, namespaces, DOM, XPath, and XML Schema. It also explores the motivations for XML and its role in the HTML world.
E N D
e-Science e-Business e-Government and their TechnologiesCore XML Bryan Carpenter, Geoffrey Fox, Marlon Pierce Pervasive Technology Laboratories Indiana University Bloomington IN 47404 January 12 2004 dbcarpen@indiana.edu gcf@indiana.edu mpierce@cs.indiana.edu http://www.grid2004.org/spring2004
What are we doing • This is a semester-long course on Grids (viewed as technologies and infrastructure) and the application – mainly to science but also to business and government • We will assume a basic knowledge of the Java language and then interweave 6 topic areas – first four cover technologies that will be used by students • 1) Advanced Java:including networking, Java Server Pages and perhaps servlets • 2) XML:Specification, Tools, Linkage to Java • 3) Web Services: Basic Ideas, WSDL, Axis and Tomcat • 4) Grid Systems: GT3/Cogkit, Gateway, XSOAP, Portlet • 5) Advanced Technology Discussions: CORBA as history, OGSA-DAI, security, Semantic Grid, Workflow • 6) Applications: Bioinformatics, Particle Physics, Engineering, Crises, Computing-on-demand Grid, Earth Science
Contents of this Lecture Set • Intro: HTML and XML and Unicode • Core XML: • XML syntax and well-formedness, DTDs and validity. • XML namespaces. • The XML DOM with linkage to Java. • XPath basics. • XML Schema. • Validation for data-centric applications. • Later lectures may include additional information on: • XHTML, SVG, RDF. • XML style languages: XSLT and CSS. • XML Databases (Xindice, Sleepycat). • Search: advanced XPath, XQuery.
Motivations for XML: a Better HTML? • Limitations of HTML: • Extensibility: HTML does not allow users to specify their own tags or attributes in order to parameterize or otherwise semantically qualify their data. • Structure: HTML does not support the specification of deep structures needed to represent database schema or object-oriented hierarchies. • Validation: HTML does not support the kind of language specification that allows applications to check data for structural validity when it is imported.
XML in the HTML world • XML = eXtensible Markup Language. • XML is a subset of SGML—Standard Generalized Markup Language, but XML is specifically designed for the web. • Specification by W3C: http://www.w3.org/XMLand lots of links likehttp://www.xml.org • XML 1.0 in February 98. • XML 1.1 became a W3C recommendation 4 Feb, 2004! • How XML fits into the new HTML world: • XML describes the logical structure of the document. • CSS (Cascading Style Sheets) and/or XSL describes the visual presentation of the document. • DOM(Document Object Model) allows scripting languages like JavaScript to access and dynamically change document objects.
Logical vs. Visual Design • Logical design of a document (content) should be separate from its visual design (presentation). • Promotes sound typography. • Encourages better writing. • Is more flexible. • Allows the same “knowledge/information” (defined in XML) to be displayed on PC’s, PDA’s, Braille devices etc. • XML used to define the logical design, with XSL (Extensible Style Language) or other mechanism used to define the visual layout (e.g. by mapping XML into HTML).
XML Design Goals • XML shall be usable over the Internet. • XML shall support a variety of applications. • XML shall be compatible with SGML. • It shall be easy to write programs that process XML documents. • Optional features in XML shall be kept to the absolute minimum, ideally zero. • XML documents should be human-legible and reasonably clear. • Design of XML should be prepared quickly. • Design of XML shall be formal and concise. • XML documents shall be easy to create. • Terseness in XML markup is of minimal importance.
Document-Centric or Data-Centric? • Roots of XML in document markup (HTML-like). • In practice use of XML as a data format has become at least as pervasive. Examples: • Use of XML format in configuration and deployment files of EJB, Tomcat, … • Uses of XML as a format for message exchange (e.g. SOAP, BEEP). • There is also an important intermediate case—XML as program text for machine interpretation. E.g.: • XSLT declarative transformation language. • WSDL interface definition language for Web services. • BPEL Web services workflow language.
Features of XML • Documents are stored in plain text and thus can be transferred and processed anywhere. • Unifying principles make it easily acceptable: • “Everything is a tree” (DOM). • Unicode for different languages.
XML and Unicode • All XML documents must be written using the Unicode character set. • Unicode is also the character set used by Java, C#, ECMAScript, …, so we should know something about it.
Unicode • Unicode (http://www.unicode.org) is an international standard character set that covers alphabets of all the World’s common written languages. • Eventually it should cover all languages, living and dead. • Unicode helps make the Web truly “worldwide”?! • Unlike, say, ASCII, which allows for 128 characters, Unicode has space for over 1,000,000, of which around 96,000 are currently allocated. • Unicode itself assigns a unique sequence number (code point) to any character, regardless its alphabet. • Three Unicodeencoding forms map these code points to sequences of fixed size units—UTF-8, UTF-16, UTF-32.
Unicode Code Points • A Unicode code point is a numeric value between 0 and 10FFFF16, commonly denoted in one of the formats: U+XXXX U+XXXXX U+10XXXX where X is a hexadecimal digit. • There are a total of 1,114,112 (= 17 · 164) code points, but most of the World’s common characters are encoded in the first 65,536 points—the Basic Multilingual Plane (BMP). • 2048 code points in BMP are disallowed because their values have a special role in UTF-16 encoding. • For each assigned character code, the Unicode standard defines a name, and “semantic” properties like case, directionality, ...
Planes • The space of 17 · 216 Unicode code points is conventionally divided into 17 planes of 216 points each. • Currently used planes include: • Note early versions of Unicode used a strict 16-bit encoding, and essentially contained just BMP
Unicode Allocation • Layout of planes:
Blocks • Planes are subdivided into blocks. • Blocks have variable size. Each block contains the characters of one alphabet or a group of related alphabets. • The following slides are a random sampling of the blocks in BMP. • I have put 128 code points on each slide, but this is just what would fit… no general significance to pages of size 128. • For all blocks in the current Unicode standard see: http://www.unicode.org/charts/
Unicode Allocation • Layout of Basic MultilingualPlane:
Unicode Allocation • Layout of Plane 1:
Encoding Forms • In electronic documents or computer programs the space of Unicode code points is normally broken down into a sequence of units, each unit having a convenient, fixed number of bits. • The Unicode standard defines 3 encoding forms. • The most straightforward is UTF-32, in which the units have size 32 bits. • This unit is easily large enough to hold the integer value of a single code point, so UTF-32 encoding is “obvious”. • But for nearly all documents, UTF-32 wastes at least half the available storage space. • Also, most programming languages work with 8 bit or 16 bit character units.
UTF-16 • The UTF-16 encoding form breaks Unicode characters into 16 bit units. • Java, for example, uses UTF-16 for chars and Strings. • One 16 bit unit is not large enough to represent all possible Unicode code points. • Code points higher than 216-1 are split over two consecutive units. • These are called surrogate pairs. The leading unit is a high-surrogate unit; trailing is a low-surrogate unit. • There are 1024 code points reserved in the BMP for high surrogates, and 1024 more reserved for low surrogates. • This allows for 1024 · 1024 surrogate pairs representing code points higher than 216-1, while ensuring a legal BMP code point can always be represented in a single unit, and such a unit can never be confused with a surrogate unit.
UTF-8 • The UTF-8 encoding form breaks Unicode characters into 8 bit units (i.e., individual bytes). • UTF-8 is a variable-width encoding with the following properties: • Any Unicode code point maps to 1, 2, 3, or 4 bytes. • Byte sub-sequences for individual characters can always be recognized by local search in the encoded string. • The Basic Latin block coding points (U+0000..U+007F) map to one byte, identical to their ASCII value. • All code points in the BMP map to at most 3 bytes. • For European texts UTF-8 will normally use 8 or 16 bits per character (vs 16 bits for UTF-16). • For East Asian texts UTF-8 will normally use 24 bits per character (vs 16 bits for UTF-16).
Encoding Schemes • The 3 encoding forms don’t quite complete the encoding schemes of Unicode, because they don’t address the endianness with which the UTF-32, UTF-16 numeric unit values are rendered to bytes (byte-serialization). • To allow applications to distinguish the endianness of a given document instance, Unicode allows a Byte Order Mark (BOM) as the first character of a document. • BOM is a code point (U+FEFF) for which the byte-reversed unit value doesn’t correspond to a legal code point, so serves to determine the actual byte order.
Unicode Summary • Unicode is a large and important standard that is a foundation for XML, HTML, etc. • Although you are unlikely to manipulate the encodings yourself, you should be aware of the pros and cons of UTF-16, UTF-8. • UTF-8 is backwards compatible with ASCII—Basic Latin texts can be read by legacy applications. • UTF-16 is better-suited for internationalization. It is the internal representation used by Java, C#, ECMAScript, …
Introduction • In this section we will describe core XML, as defined by the XML specification document from W3C. • XML is a format for documents—originally documents for the Web—but its scope is wider than that. • XML is a subset of SGML—Standard Generalized Markup Language. Some features of XML exist simply for compatibility with SGML. • XML can also be viewed as a kind of generalization of HTML—presumably familiar from the Web.
XML Parsers and Applications • For purposes of this section an application is any program that reads data from an XML document. • Applications normally do not (and probably should not) read the text of XML documents directly. • The XML specification assumes that this text is initially processed by a piece of software called an XML processor. We will also refer to this as an XML parser. • The parser exhaustively checks that the text is in a legal XML form, then extracts the essential data from the document, and hands that data to the application.
Reading XML Data <?xml version="1.0"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd"> <svg width="500" height="500"> <g transform='rotate(45)'> <circle cx='150' cy='50' r='25'/> <text x='125' y='100'>A Circle</text> </g> </svg> XML Parser XML Source svg width 500 height 500 Parsed XML Data g transform rotate(45) Application circle cx 150 cy 50 r 25 text x 125 y 100 A Circle
Well-formed Documents • An XML document follows a strict syntax. For example: • An XML document contains regions of text called elements, delimited by matchingstart-tags and end-tags. Elements must be correctly nested. • Start-tags may include attribute specifications, where attribute values are strings delimited by matching quote marks. • A document that obeys the full set of these rules is called well-formed. • Every legal XML document must be well-formed, otherwise it cannot be parsed.
Examples • Well-formed: <html> <body style="font-style: italic"> This is a well-formed document. </body> </html> • Not well-formed: <html> <body style=font-style: italic> This is not a well-formed document. </html> </body> • The style attribute value is not in quote marks, and the html and body tags don’t nest correctly.
Install Xerces • The Xerces parser is a product of the Apache XML project, http://xml.apache.org. • Follow the “Xerces Java 2” project link and go to the download area, then to the master distribution directory or a mirror directory. • Download Xerces-J-bin.2.6.2.zip, and extract it to a suitable place, e.g. C:\ • When extracting, remember to select“Use folder names”!! • This should create a folder called, e.g., C:\xerces-2_6_2\.
Put Xerces on your Class Path • Using the menu at Control Panel→System→Advanced→Environment Variables add the jar files xercesSamples.jar, xercesImpl.jar, and xml-apis.jar, to you class path. • E.g. append …;C:\xerces2_6_2\xercesSamples.jar;C:\xerces2_6_2\xercesImpl.jar;C:\xerces-2_6_2\xml-apis.jar to the value of your CLASSPATH variable.
Example Using Xerces • Copy the two HTML examples given above to files called, say, wellformed.html and illformed.html. Then, in a new Command Prompt window, try running the commands: > java dom.Writer wellformed.html … > java dom.Writer illformed.html … • The first command should just echo the document. The second should print a syntax error message. • dom.Writeris one of the sample applications in the Xerces release. It simply uses the Xerces parser to convert the source file to a tree data structure (DOM), then converts the tree back to nicely formatted XML, which it prints.
“Rolling Your Own” Parser? • People approaching XML sometimes decide they can write their own “lightweight” parser that handles just the bit of XML their application needs. • In general this is a bad idea! • We will see later that even basic XML is a moderately complex specification; unless you are going to invest a lot of effort it is unlikely you can parse the full specification more efficiently than existing parsers. • If you subset the specification you may be compromising the most crucial advantage that XML brings to your application—interoperability. • Later in these lectures we will see how to use the Xerces parser from your own Java programs, to read XML input.
Valid Documents • An XML document may optionally include a Document Type Definition (DTD). • This declares the names of all elements and attributes appearing in the document, and how they may nest. • The DTD also declares and defines entities that may be referenced from within document content. • A well-formed XML document that includes a DTD—and accurately follows the declarations in that DTD—is called valid.
Invalid Documents • It is quite possible to parse invalid (but well-formed) documents, by using a non-validating parser. • Many applications accept XML files without DTDs, which are therefore technically invalid. • Applications may exploit “validation” mechanisms other than DTDs. An important one is XML Schema which we will discuss later. • A document validated against an XML Schema usually does not have a DTD, so technically is invalid as far as the base XML specification is concerned. • But of course it is valid relative to the XML Schema specification!
Validation Side Effects • The use of a validating parser certainly affects what documents are treated as legal. • In some cases “switching on” validation may also alter the exact data passed from the parser to application. These effects will be considered when we discuss DTDs.
Physical Entities • An XML document is represented by one or more “storage units” (typically files), called “entities”. • We can enumerate five kinds: • Document entities—root XML documents. • Parsed external entities, which contain fragmentary XML content. • External DTD subsets, which contain some or all of the DTD declarations needed by a document. • External parameter entities, which also contain fragmentary DTD content. • Unparsed external entities, which are usually complete “binary” files in some native format (not XML).
Physical Structure • The structure of a non-trivial XML document is illustrated in the following figure. • Every XML document must have exactly one document entity. • It may also involve zero or more external entities: • The document entity may reference any number of external general entities. These can be parsed external entities or unparsed external entities. A parsed external entity may in turn reference other external general entities. • The document may have at most oneexternal DTD subset. • A DTD subset in the document entity, or an external DTD subset, may reference any number of external parameter entities (which may in turn reference other external parameter entities).
A Complex XML Document Document Entity External Parameter Entity External DTD Subset DTD External Parameter Entity Content Parsed External Entity Parsed External Entity Parsed External Entity Unparsed External Entity
Syntactic Features • The following two tables summarize the “top-level” syntax of all the constructs in XML. Full details will be given in later slides, as needed. • The first columns give an abbreviated example of the syntax, the second columns (“what?”) describe the construct, and the third columns (“where?”) specify the places in an XML document where the construct may appear. • In a “where?” column, Document means at the top-level of the document entity, and Content means in the kind of content allowed in an element—also called Parsed Character Data. • A Literal is character data in quotes—exactly what can appear in a literal depends strongly on its context. • XML Names will be discussed shortly.