680 likes | 843 Views
XML 101: Hype or Hoax ?. IST 597 Nov. 10, 2005 (Guest lecture in James Wang’s class) Dongwon Lee, Ph.D. Overview. HTML Model XML Model RDBMS vs. XML XML Schema XML Tools …. www.w3c.org. www.oasis-open.org. History of HTML. HTML: Hyper-Text Markup Language
E N D
XML 101: Hype or Hoax ? IST 597 Nov. 10, 2005 (Guest lecture in James Wang’s class) Dongwon Lee, Ph.D.
Overview • HTML Model • XML Model • RDBMS vs. XML • XML Schema • XML Tools • … Nov. 10, 2005
www.w3c.org Nov. 10, 2005
www.oasis-open.org Nov. 10, 2005
History of HTML • HTML: Hyper-Text Markup Language • Invented by Tim Berners-Lee and Robert Caillau at CERN in 1991 • What is hyper-text? • A document that contains links to other documents (and text, sound, images...) • Invented around 1945 by Vannevar Bush • What is a markup language? • A notation for writing text with markup tags (<tag>) • Tags indicate the structure of the text • Tags have names and attributes • Tags may enclose a part of the text • Invented around 1970 by Charles F. Goldfarb (SGML) Nov. 10, 2005
HTML • HTML was designed to display data and to focus on how data looks. Nov. 10, 2005
XML • What does XML stand for (GRE question)? • X-rated Modular 3 Language • eXtensible Standard ML Programming Language • eXtensible UML (Unified Modeling Language) for the Web • eXtensible Meta Language for the Web • eXtensible Markup Language originating from SGML • None of the above Nov. 10, 2005
XML • XML is a framework for defining markup languages: • XML was designed to describe data, not format. • There is no fixed collection of markup tags. One must define your own tags, tailored for our kind of information • Allow tailor-made markup for any imaginable application domain • XML uses a schema language (eg, DTD, XML-Schema) to formally describe the data. • XML separates syntax from semantics to provide a common framework for structuring information • Web browser rendering semantics is separately defined by stylesheets Nov. 10, 2005
XML • XML is not a replacement for HTML: • HTML should ideally be just another XML language • in fact, XHTML is just that • XHTML is a (very popular) XML language for hypertext markup • HTML is about displaying information, but XML is about describing information. Nov. 10, 2005
<center> <h1>SIGMOD</h1> <p><b> <u>ACM</u> <a href=“sigmod02.org”>SIGMOD Conference</a>, Madison, WI, 2002 </b></p> </center> <event eID=“sigmod02”> <acronym>SIGMOD</acronym> <society>ACM</society> <url>www.sigmod02.org</url> <loc> <city>Madison</city> <state>WI</state> </loc> <year>2002</year> </event> HTML vs. XML XML HTML Nov. 10, 2005
HTML vs. XML • Need a stylesheet to define browser presentation semantics HTML XML CSS/XSL Browser Browser Nov. 10, 2005
HTML vs. XML • Need a stylesheet to define browser presentation semantics Data/Format Data Format Browser Browser Nov. 10, 2005
Database Perspective • DB must support • Capture • Storage • Retrieval • Exchange • XML originally as the language to exchange data over Web • Replacing EDI (Electronic Data Interchange) Nov. 10, 2005
RDBMS vs. XML Example • AFV receives 100+ videos every week • [IST210 HW#2, 2004] Build a DB to be able to answer the following queries: • Who sent which videos? • Show me all videos about Cat category • How many videos in a database since Jan 1, 2003? • Which is the video with the best rating for the 1st week of Jan? • Where does the sender James live? Phone? Gender? • How many videos does James send so far? • Show me all the ghost videos (ones without sender information) Nov. 10, 2005
ER Model Category Videos VID Date Rating Sends Phone Name Owners Gender Address Nov. 10, 2005
ER => RDBMS Category Videos Videos VID Date Rating Sends Sends Owners Phone Name Owners Gender Address Nov. 10, 2005
RDBMS Video Sends Owners Nov. 10, 2005
Changes #1 • VHS video => VHS, CD, DVD • 100+ videos => 1 million videos Nov. 10, 2005
RDBMS Video Sends Owners Nov. 10, 2005
Changes #2 • Arbitrary name formats for owners • Eg, J. Doe vs. Dr. “Jonny” John Jay Doe Jr • 100+ different ways to capture owners’ information • “100 E. Foster Ave #200, State College, PA, 16801” vs. • adr1=“100 E. Foster #210”, adr2=“State College, PA, 16801” • street=“100 E. Foster, #200”, city=“State College”, state=“PA”, zip=“16801” • 100+ different video formats with varying properties => 1000+ attributes for Videos Nov. 10, 2005
RDBMS: Finest Granularity Video Sends Owners Nov. 10, 2005
RDBMS: Coarsest Granularity Video Sends Owners Nov. 10, 2005
Violation Of 1NF RDBMS: Ideal Case Video Sends Owners Nov. 10, 2005
XML Video <VideoTable> <Video vid=“100” category=“comedy” date=“2005/1/1” rating=“5” att2=“10” att1000=“T1” /> <Video vid=“200” category=“action” date=“2005/1/10” rating=“4” att1=“20” /> <Video vid=“300” category=“SF” date=“2004/12/31” rating=“5” att1000=“S20” /> </VideoTable> Nov. 10, 2005
Address Adr1 Adr2 State Address Street City State Zip XML Owners <OwnerTable> <Owner phone=“564-3456” gender=“F”> <Name FN=“Jenny” /> <Address> <Street>310 N. Atherton</Street><City>State College</City> <State>PA</State><Zip>16801</Zip> </Address> </Owner> <Owner phone=“123-4567” gender=“M”> <Name Prefix=“Dr.” NN=“Jonny” FN=“John” MN=“Jay” LN=“Doe” Suffix=“Jr.” /> <Address> <Adr1>10 S, Beaver</Adr1> <Adr2><State>PA</State></Adr2> </Addres> </Owner> </OwnerTable> Nov. 10, 2005
Structured model Large scale data Limited semantics Focus: how to handle large size data efficiently? Unstructured or Semi-structured model Small to Medium scale data Document vs. data Flexible and rich semantics Focus: how can one handle large number of small size data with various formats efficiently? RDBMS vs. XML RDBMS XML Nov. 10, 2005
A conceptual view of XML • An XML document is an ordered, labeled tree • Character data leaf nodes contain the actual data (text strings) • Elements nodes are each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • and can have child nodes Nov. 10, 2005
A concrete view of XML • An XML document is a text with markup tags and other meta-information. • Markup tags denote elements: ..<foo attr="val" ...>bar</foo>... | | | | | | | a matching element end tag | | the contents “bar” of the element | an attribute with name attr and value val, enclosed by ' or "an element start tag with name foo • An XML document must be well-formed: • start and end tags must match • element tags must be properly nested Nov. 10, 2005
Example of XML document <?xml version="1.0"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> • The XML declaration (1st line) should always be included; root element is <note> • In XML all elements must have a closing tag: • <p>foo<p>bar => <p>foo</p><p>bar</p> • XML tags are case sensitive: • <Message>This is incorrect</message> • Must be properly nested within each other: • <b><i>This is incorrect</b></i> Nov. 10, 2005
Element vs. Attribute • The same information can be captured by either Element or Attribute in XML <event eID=“sigmod02”> <acronym>SIGMOD</acronym> <society>ACM</society> <url>www.sigmod02.org</url> <loc> <city>Madison</city> <state>WI</state> </loc> <year>2002</year> </event> <event eID=“sigmod02” acronym=“SIGMOD” society=“ACM url=“www.sigmod02.org” city=“Madison” state=“WI” year=“2002”/> Nov. 10, 2005
Applications of XML • XML is a meta-language to create another languages; the main application of XML is making new languages • XHTML: W3C's XMLization of HTML 4.0. <?xml version="1.0" encoding="UTF-8"?> <html xmlns=http://www.w3.org/1999/xhtml xml:lang="en"> <head><title>Hello world!</title></head> <body><p>foobar</p></body> </html> Nov. 10, 2005
Applications of XML (cont.) • CML: Chemical Markup Language • There are +1000 new markup languages made by XML (eg, www.schema.net) <molecule id="METHANOL"> <atomArray> <stringArray builtin="elementType">C O H H H H</stringArray> <floatArray builtin="x3" units="pm"> -0.748 0.558 -1.293 -1.263 -0.699 0.716 </floatArray> </atomArray> </molecule> Nov. 10, 2005
Why is XML NOT important? • Some people believe that XML is NOT important • The best answer that I have gotten so far is (from SIGMOD 2001 Panel Talk): XML is NOT important because … • XML is a wire language • But, world is going for wireless • What do you think? Nov. 10, 2005
Why is XML important? • Technically, … Nothing; Just old simple tree model… • Non-technically, … • Hot ($$$) • The standard for representation of Web information • The real force of XML is generic languagesandtools! • By building on XML, you get a massive (standard) infrastructure for free Nov. 10, 2005
XML Usages… Nov. 10, 2005
XML Offers • Common extensions to the core XML specification: a namespace mechanism, document inclusion, etc. • Schemas: grammars to define classes of documents • Linking between documents: a generalization of HTML anchors and links • Transformation: conversion from one document class to another • Querying: extraction of information, generalizing relational databases • More… Nov. 10, 2005
XML Schema Languages • A schema is a formal description of structures and constraints • Relational schema consists of table, column, constraints, etc. • A schema language is a formal language for expressing schemas. • Schema processing: Given an XML document and a schema, a schema processor checks for validity, i.e. that the document conforms to the schema requirements Nov. 10, 2005
XML Schema Languages (cont) Nov. 10, 2005
Why Bother? • Why bother formalizing the syntax with a schema? • A formal definition provides a precise but human-readable reference • Schema processing can be done with existing implementations • Your own tools for your language can benefit: by piping input documents through a schema processor, you can assume that the input is valid and defaults have been inserted Nov. 10, 2005
Which Schema Language? • Many proposals competing for acceptance • W3C Proposals: DTD, XML Data, DCD, DDML, SOX, XML-Schema, … • Non-W3c Proposals: Assertion Grammars, Schematron, DSD, TREX, RELAX, XDuce, RELAX-NG, … • Different applications have different needs from a schema language Nov. 10, 2005
Expressive Power (content model) DTD XML-Schema XDuce, RELAX-NG Nov. 10, 2005
Closure (content model) DTD XML-Schema XDuce, RELAX-NG Closed under INTERSECT Closed under INTERSECT, UNION, DIFFERENCE Nov. 10, 2005
DTD: Document Type Definition • XML DTD is a subset of SGML DTD • XML DTD is the standard XML Schema Language of the past (and present maybe…) • It is one of the simplest and least expressive schema languages proposed for XML model • It does not use XML tag notation, but use its own weird notation • It cannot express relatively complex constraint (eg, key) well • It is being replaced by XML-Schema of W3C and RELAX-NG of OASIS Nov. 10, 2005
DTD: Elements • <!ELEMENT element-namecontent-model>associates a content model to all elements of the given name content models: • EMPTY: no content is allowed • ANY: any content is allowed • (#PCDATA|element-name|...): "mixed content", arbitrary sequence of character data and listed elements Nov. 10, 2005
DTD: Elements • Regular Expression over element names: sequence of elements matching the expression • choice: (...|...|...) • sequence: (...,...,...) • optional: ...? • zero or more: ...* • one or more: ...+ Nov. 10, 2005
DTD: Elements • Eg: “Name” element consists of an • optional FirstName, followed by • mandatory LastName elements, where • Both are text string <!ELEMENT Name (FirstName? , LastName) <!ELEMENT FirstName (#PCDATA)> <!ELEMENT LastName (#PCDATA) Nov. 10, 2005
DTD: Attributes • <!ATTLIST element-nameattr-nameattr-typeattr-default ...> declares which attributes are allowed or required in which elements attribute types: • CDATA: any value is allowed (the default) • (value|...): enumeration of allowed values • ID, IDREF, IDREFS: ID attribute values must be unique (contain "element identity"), IDREF attribute values must match some ID (reference to an element) • ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION: just forget these... (consider them obsolete) Nov. 10, 2005
DTD: Attributes • Attribute defaults: • #REQUIRED: the attribute must be explicitly provided • #IMPLIED: attribute is optional, no default provided • "value": if not explicitly provided, this value inserted by default • #FIXED "value": as above, but only this value is allowed Nov. 10, 2005