260 likes | 356 Views
XML. SNU OOPSLA Lab. October 2005. Contents. Semistructured Data Introduction History XML Application DTD & XML Schema DOM & SAX Summary Online Resources. Semistructured Data(1/3). Semistructured Data and XML Integration of heterogeneous sources Data sources with non-rigid structure
E N D
XML SNU OOPSLA Lab. October 2005
Contents • Semistructured Data • Introduction • History • XML Application • DTD & XML Schema • DOM & SAX • Summary • Online Resources
Semistructured Data(1/3) • Semistructured Data and XML • Integration of heterogeneous sources • Data sources with non-rigid structure • Biological data • Web data • Characteristics of Semistructured Data • Missing or additional attributes • Multiple attributes • Different types in different objects • Heterogeneous collections self-describing, irregular data, no a priori structure
Semistructured Data(2/3) Data Model Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname atomic object first firstname lastname lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM)
Semistructured Data(3/3) Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } Syntax for Semistructured Data
Introduction(1/4) • XML • An acronym for ‘eXtensible Markup Language’ • A meta-language that describes other languages • A data format for storing structured and semi-structured text for dissemination and ultimate publication, perhaps on a variety of media
Introduction(2/4) • Properties • Tags enclose identifiable parts of the document • Self-describing • Physical/logical structure • Physical structure : allows components of the document, called entities • Logical structure : allows a document to be divided into named units and sub-units, called elements
Introduction(3/4) Physical Structure Logical Structure Document entities Unit (internal) (separate) Sub-unit elements
Introduction(4/4) XML markup <warning> <para> This substance if hazardous to health </para> <para> See procedure 12A. 7 for information on protective clothing required. </para> <logo …/> </warning> <transaction> <time date=“19980509”/> <amount>123</amount> <currency type=“pounds”/> <from id=“x98765”> J. Smith</from> <to id=“x56565>M. Jones</to> </transaction> XML document
History(1/2) XML 1997 WWW HTML 1992 SGML 1986 GM Internet GM = Generalized Markup 1960
History(1/2) • 1960’s, IBMGML(GeneralizedMarkup Language) • 1980’s, ISO 8879, SGML(Standard GeneralizedMarkup Language) • Early 1990’s, HTML(HyperText Markup Language) • 1996, W3C’sXML • 1998, XML 1.0 • 1999, RDF(Resource Description Framework)
Application DBMS XML ASP, Java, VB SAX Events Parser HTML Browser DTD XSL Processor Tree DOM DOM API DOM(Document Object Model) SAX(Simple APIs for XML) XSL(eXtensible Stylesheet Language) ASP(Active Server Page) Data exchange applications
An XML Document <?xml version=“1.0”?> <!DOCTYPE sigmodRecord SYSTEM sigmodRecord.dtd”> <sigmodRecord><issue> <volume>1</volume> <number>1</number> <articles><articles> <title> XML Research Issues</title> <initPage> 1 </initPage> <endPage> 5 </endPage> <authors> <author AuthorPosition=“00”> Tom Hanks </author> … </authors></article></articles></issue> </sigmodRecord>
DTD(1/2) • DTD(Document Type Definition) • An optional but powerful feature of XML • Comprises a set of declarations that define a document structure tree • Some XML processors read the DTD and use it to build the document model in memory • Establishes formal document structure rules • It define the elements and dictates where they may be applied in relation to each other
DTD(2/2) • Declare Vs. Define • Declare “This document is a concert poster” • Define “A concert poster must have the following features” • DTD define • Element type + Attribute + Entities • Valid Vs. Invalid • Valid conforms to DTD • Invalid fail to conform to DTD Well formed XML Document Valid XML Document
XML Schema • Schema • W3C standard : specifies structure of XML documents • Data types for elements/attributes • String, int, float • Unordered set is also allowed • Derivation of types are allowed • Replaces DTDs • Removes syntactic distinctions between DTD and XML • Richer types compared to DTD
XML Schema Example <xsd:element name=“article” minOccurs=“0” maxOccurs=“unbounded”> <xsd:complexType><xsd:sequence> <xsd:element name=“title” type=“xsd:string”/> <xsd:element name=“initPage” type=“xsd:string”/> <xsd:element name=“endPage” type=“xsd:string”/> <xsd:element name=“author” type=“xsd:string”/> </xsd:sequence></xsd:complexType> <xsd:element> DTD <!ELEMENT article (title,initPage,endPage,author)> <!ELEMENT title (#PCDATA)> <!ELEMENT initPage (#PCDATA)> <!ELEMENT endPage (#PCDATA)> <!ELEMENT author (#PCDATA)>
DOM(1) • Characteristics • Hierarchical (tree) object model for XML documents • Associate list of children with every node • Preserves the sequence of the elements in the XML documents sigmodRecord issue volume number articles XML document title initPage endPage
DOM(2) • DOM interfaces • Node : The base data type of the DOM. • Element : The vast majority of the objects you’ll deal with are Elements. • Attr : Represents an attribute of an element. • Text : The actual content of an Element or Attr. • Document : Represents the entire XML document
SAX(1) • DOM : expensive to materialize for a large XML collection • Characteristics • Event-driven : fire an event for every open tag/end tag • Does not require full parsing • Enables custom object model building Document Handler <!……………> <-> …………. </-> create startDocument() Application startElement() characters() endElement() Feedback When event driven give endDocument() parsing Parser Event driven
SAX(2) • The SAX API actually defines four interfaces for handling events • EntityHandler • TDHandler • DocumentHandler • ErrorHandler • All of these interfaces are implemented by HandlerBase.
DOM vs SAX(1/3) • Why use DOM? • Need to know a lot about the structure of a document • Need to move parts of the document around • Need to use the information in the document more than once • Why use SAX? • Only need to extract a few elements from an XML document
DOM vs SAX(2/3) <book id="1"><verse> Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans. Many a brave soul did it send hurrying down to Hades, and many a hero did it yield a prey to dogs and vultures, for so were the counsels of Jove fulfilled from the day on which the son of Atreus, king of men, and great Achilles, first fell out with one another.</verse><verse> And which of the gods was it that set them on to quarrel? It was the son of Jove and Leto; for he was angry with the king and sent a pestilence upon ... • Doing this with the DOM would take a lot of memory • SAX API would be much more efficient
DOM vs SAX(3/3) ... <address><name> <first-name>Mary</first-name> <last-name>McGoon</last-name> </name><street>1401 Main Street</street> <city>Anytown</city> <state>NC</state> <zip>34829</zip> </address> <address> <name>….. <street> ….. </address> <address> <name>….. <street> ….. </address> If we were parsing an XML document containing 10,000 addresses, and we wanted to sort them by last name?? DOM would automatically store all of the data. We could use DOM functions to move the nodes n the DOM tree
Summary • XML • eXtensible Markup Language • A data format for storing structured and semi-structured text • physical/logical structure • DTD& XML Schema • Establishes formal document structure rules • DOM & SAX API • DOM: Need to know a lot about the structure of a document • SAX: Need to extract a few elements from an XML document
Online Resources • XML tutorial • http://www.xml.com • http://www.w3c.org • http://www.w3schools.com/ • http://www.xmltraining.com/course-search-xml+online+tutorials • http://xmlfiles.com/