1 / 108

Understanding XML: Documents vs. Databases

Explore the differences and similarities between document-oriented and database-oriented data management using XML. Learn about XML basics, query languages, and additions like Xlink and RDF. Discover use cases and paradigms for managing documents and databases.

josenjones
Download Presentation

Understanding XML: Documents vs. Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIS 550Fall 2001 Handout 6 XML Fall 2001

  2. Outline (ambitious) • Background: documents (SGML/HTML) and databases (structured and semistructured data) • XML Basics and Document Type Descriptors • XML query languages: XML-QL and XSL. • XML additions: Xlink, Xpointer, RDF, SOX, XML-Data • Document Object Model (XML API's) Fall 2001

  3. Some Useful Articles XML, Java, and the future of the web http://webreview.com/wr/pub/97/12/19/xml/index.html XML and the Second-Generation Web http://www.sciam.com/1999/0599issue/0599bosak.html Articles/standards for XML, XSL, XML-QL http://www.w3c.org/ http://www.w3.org/TR/REC-xml Fall 2001

  4. Part I: Background What’s the difference between the world of documents and information retrieval and databases and query interfaces? Fall 2001

  5. Document world > plenty of small documents > usually static > implicit structure section, paragraph, toc, > tagging > human friendly > content form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description Documents vs Databases Fall 2001

  6. Documents editing printing spell-checking counting words retrieving (IR) searching Database updating cleaning querying composing/transforming What to do with them Fall 2001

  7. HTML • Lingua franca for publishing hypertext on the World Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure. • Fixed tag set. Text (PCDATA) Opening tag • <HTML> • <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> • <BODY> • <H1>Introduction</H1> • <IMGSRC=”dragon.jpeg"WIDTH="200"HEIGHT="150” > • </BODY> • </HTML> Closing tag “Bachelor” tag Attribute name Attribute value Fall 2001

  8. Thin red line • The line between the document world and the database world is not clear. • In some cases, both approaches are legitimate. • An interesting middle ground is data formats -- of which XML is an example • Examples • Personal address book • Swissprot • ASN.1 Fall 2001

  9. Personal address book over 20 years 1977 NAchison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science A University of Edinburgh A Kings Buildings A Edinburgh E12 8QQ A Scotland T 031-123-8855 ext. 4359 (work) T 031-345-7570 (home) N Albani, Paolo F Prof. Paolo Albani A Dip. Informatica e Sistemistica A Universita di Roma La Sapienza ... 1990 N Achison, Malcolm F Prof. M.P. Achison A Dept. of Computing Science A University of Glasgow A Lilybank Gardens A Glasgow G12 8QQ A Scotland T 041-339-8855 ext. 4359 T 041-357-3787 (private) T 031-667-7570 (home) X 041-339-0090 C mpa@uk.ac.gla.cs N Achison, Malcolm F Prof. M.P. Achison A 34 Inverness Place A Edinburgh, EH3 8UV 1980 N Achison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science .... T 031-667-7570 (home) C mpa@uk.ac.ed.cs 1997 N Achison, Malcolm F Prof. M.P. Achison A Department of Computing Science ... T 031-667-7570 (home) X 041-339-0090 C mpa@dcs.gla.ac.uk W http://www.dcs.gla.ac.uk/mpa 2000 ? Fall 2001

  10. Swissprot ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARANISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). Fall 2001

  11. Swissprot (cont’d) CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV ... Fall 2001

  12. The Structure of XML • XML consists of tags and text • Tags come in pairs<date> ...</date> • They must be properly nested <date> <day> ... </day> ... </date> --- good <date><day> ... </date>... </day> --- bad (You can’t do <i> ... <b> ... </i> ...</b> in HTML) Fall 2001

  13. XML text XML has only one “basic” type -- text. It is bounded by tags e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data Fall 2001

  14. XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : <person> <name>Malcolm Atchison</name> <tel>(215) 898 4321</tel> <email>mp@dcs.gla.ac.sc</email> </person> Fall 2001

  15. XML structure (cont.) • We can represent a list by using the same tag repeatedly: <addresses> <person> ... </person> <person> ... </person> <person> ... </person> ... </addresses> Fall 2001

  16. Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element element, a sub-element of not an element Fall 2001

  17. person name tel tel email XML is tree-like Malcolm Atchison (215) 898 4321 (215) 898 4321 mp@dcs.gla.ac.sc Semistructured data models typically put the labels on the edges Fall 2001

  18. Mixed Content An element may contain a mixture of sub-elements and PCDATA <airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto> </airline> Data of this form is not typically generated from databases. It is needed for consistency with HTML Fall 2001

  19. A Complete XML Document <?xmlversion="1.0"?> <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> Fall 2001

  20. Two ways of representing a DB projects: title budget managedBy employees: name ssn age Fall 2001

  21. Project and Employee relations in XML <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> Projects and employees are intermixed <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </db> Fall 2001

  22. Project and Employee relations in XML (cont’d) Employees follow projects <employees> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age>35 </age> </employee> : <employees> </db> <db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <project> <title> Auto guided vehicles </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </projects> Fall 2001

  23. Project and Employee relations in XML (cont’d) Or without “separator” tags … <db> <projects> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> <title> Auto guided vehicles </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> : </projects> <employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees> </db> Fall 2001

  24. Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element <entry> <wordlanguage= “en”> cheese </word> <wordlanguage= “fr”> fromage </word> <wordlanguage= “ro”> branza </word> <meaning> A food made … </meaning> </entry> Fall 2001

  25. Attributes (cont’d) Another common use for attributes is to express dimension or type <picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif”compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data> </picture> A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed . Fall 2001

  26. When to use attributes It’s not always clear when to use attributes <person ssno= “123 45 6789”> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> <person> <ssno>123 45 6789</ssno> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> Fall 2001

  27. Using IDs <family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> <mother/> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person> </family> Fall 2001

  28. ODL schema classMovie ( extentMovies, key title ) { attribute string title; attribute string director; relationshipset<Actor> casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set<Movie> acted_In inverse Movie::casts; attribute int age; attribute set<string> directed; } ; Fall 2001

  29. An example <db> <movie id=“m1”> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=“a1 a3”></cast> <budget>100,000</budget> </movie> <movie id=“m2”> <title>Dragonheart</title> <director>Rob Cohen</director> <cast idrefs=“a2 a9 a21”></cast> <budget>110,000</budget> </movie> <movie id=“m3”> <title>Moondance</title> <director>Dagmar Hirtz</director> <cast idrefs=“a1 a8”></cast> <budget>90,000</budget> </movie> : <actor id=“a1”> <name>David Kelly</name> <acted_In idrefs=“m1 m3 m78” > </acted_In> </actor> <actor id=“a2”> <name>Sean Connery</name> <acted_In idrefs=“m2 m9 m11”> </acted_In> <age>68</age> </actor> <actor id=“a3”> <name>Ian Bannen</name> <acted_In idrefs=“m1 m35”> </acted_In> </actor> : </db> Fall 2001

  30. Part II The Document Object Model (DOM) Programming with XML Fall 2001

  31. XML Parsers • traditional: build data structure (DOM) • event based: SAX (Simple API for XML) • http://www.megginson.com/SAX • write handler for start tag and for end tag Fall 2001

  32. Programming in XML -- why we need database technology • Let’s examine some elements of the Document Object Model. It provides an interface to parsed data. Fall 2001

  33. DOM -- overview • Interface to parsed XML • “… language-neutral...” interface (IDL) • Level 1. Functionality for XML document navigation and manipulation. • Level 2. Stylesheets and namespaces Level 3. (future) Document loading and saving DTDs and schemas http://www.w3.org/DOM/ Fall 2001

  34. Document getTagName() = “part” Element Element Element Attr Attr getTagName() = “name” getTagName() = “weight” CharacterData CharacterData getName()= “part” getValue=“at23” getName()= “units” getValue=“mks” getData()= “widget” getData()= “0.454” DOM representation -- a tree of nodes <partdb> <part id = “a123” units= “mks”> <name> widget </name> <weight> 0.454 </weight> </part> </partdb> Fall 2001

  35. Node interface public interface Node { ... public String getNodeName(); ... public NodeList getChildNodes(); ... public NamedNodeMap getAttributes(); ... } Fall 2001

  36. Sub-elements --an “array” public interface NodeList { public Node item(int index); public int getLength(); } public interface NamedNodeMap { public Node getNamedItem(String name); ... } Attributes -- a “dictionary” Fall 2001

  37. <doc1> <employee> <name> John Doe </name> <contact-info> <address> … </address> <tel> 123 7456 </tel> <email> jd@abc.edu</email> </contact-info> <dept> Math </dept> </employee> <employee> … </employee> ... </doc1> A common form of data extraction John Doe 123 7456 Jane Dee 234 5678 … ... Find the names and telephones of all employees in Math Fall 2001

  38. Top-level traversal in DOM public class Test { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); NodeList nodes = doc.getDocumentElement.getChildNodes(); for (int i=0; i<nodes.getLength(); i++) { Node n = (Element) nodes.item(i); //coercion // select Math depts } } } Not specified by DOM Fall 2001

  39. Selecting the math departments NodeList ndl = n.getChildNodes(); for(int idl=0; idl<ndl.getLength(); idl++) { if ((n.tagName = "dept") && (((CharacterData) (n.getFirstChild)).getData="math")) // coercion { //inner code return; } } Fall 2001

  40. Preorder search of subtree. Convenient, but is it what we want? The inner code NodeList nnl= n.getChildNodes(); for(int ii=0; ii < nnl.getLength(); ii++) { Node nn = (Element) nodes.item(i); //coercion if (nn.tagName = "name" ) Nodelist ncl = n.getElementsByTagName("tel"); for(int iii = 0; iii < cl.getLength(); iii++) { nc = ncl.item(iii); System.out.print((CharacterData) nn.firstChild).data //coercion System.out.println((CharacterData) nc.firstChild).data //coercion } } Fall 2001

  41. Comments on our code • Already quite cumbersome. Compare with an “equivalent” semistructured database query: • Code may fail wherever there is a coercion, or give “empty” results. • Code is already inefficient (double iterations over the same set of nodes) • Need for types !!! select {name: $N, tel: $T} where {name: $N, dept: “Math”, contact-info.tel: $T} <- DB Fall 2001

  42. <doc1> <employee> <name> John Doe </name> <contact-info> <address> … </address> <tel> 123 7456 </tel> <email> jd@abc.edu</email> </contact-info> <dept> Math </dept> </employee> <employee> … </employee> ... </doc1> Constructing data <doc2> <employee> <name> John Doe </name> <tel> 123 7456 </tel> </employee> <employee> <name> Jane Dee </name> <tel> 234 5678 </tel> </employee> ... </doc2> Fall 2001

  43. Constructing Data using the DOM Document d = new DocumentImplementation … Element root = d.createElement(“doc2”) // set root of document //top level loop { Element emp = d.createElement(“employee”) root.appendChild(emp) //innermost loop { ... Element name = d.createElement(“name”) // set s to appropriate character string name.appendChild(createCDATASection(s)) emp.appendChild(name) ... } … } All node constructors are methods of the document implementation Could also use an existing node or a “cloned” node Fall 2001

  44. <doc1> <employee> <name> John Doe </name> <contact-info>...</contact-info> <dept> Math </dept> </employee> ... </doc1> A join <doc3> <row> <name> John Doe </name> <building> A123 </building> </row> <row> <name> Jane Dee </name> <building> B456 </building> </row> ... </doc3> <doc2> <department> <dname> Math </dname> <building> A123 </building> </department> ... </doc2> The names of employees and their department buildings Fall 2001

  45. Implementing a Join in the DOM?? nl1 = r1.getChildNodes() //r1 is root of doc1 for (int i1 = 1; i1 < n1.getLength; i1++) { nl2 = r2.getChildNodes() //r1 is root of doc2 for (int i2 = 1; i2 < n2.getLength; i2++) … } • Even if we can get both documents in core, this is not the most efficient method • If not? • This is a typical database query! Fall 2001

  46. Conclusions so far • We need types for our XML documents, and we’d like to be able to use them as a static check of our program correctness. • We need database technology for large “documents” and -- perhaps -- for simplicity of code. Fall 2001

  47. Before we leave APIs… SAX, a low-level alternative to DOM • SAX = Simple API for XML • supported by most XML parsers • event-driven • Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events, e.g., • startDocument • startElement • endElement • endDocument • The programmer has to write a document handler that captures these events and do something with the tokens • Simpler than DOM: programmer must do more storage management • Less likely than DOM to fail on large documents!! Fall 2001

  48. Why Databases Can Help • Efficiency. Querying and Storage • Simplicity (a consequence of efficiency?) Query languages have a simple “algebraic” structure. Query transformation = optimization. • (Maybe) suggestions for type systems/constraints. select name, bldg from emps, depts where emps.dname = depts.dname Efficiently executed in SQL through use of indexing and clever join algorithms) Fall 2001

  49. Other things that databases might do for XML • Concurrency (allow a granularity finer than that of a file) • Proposals for updates? • Integrity. Keys and referential integrity (XML proposals still weak) • Data mining • Compression Fall 2001

  50. The Down-Side • Query languages are typically rather weak. You can’t always express a computation in a QL. • We may have to tamper with the semantics of XML (sets vs. lists) • We may have to simplify XML or its attendant “type systems” (DTDs, XML-schema) -- No bad thing :) Fall 2001

More Related