170 likes | 186 Views
Learn about XML, a document syntax standard for electronic data exchange and storage. Understand its history, design goals, and the rules for creating well-formed XML documents.
E N D
MSc IT UFCE8K-15-MData ManagementPrakash ChatterjeeRoom 3P16prakash.chatterjee@uwe.ac.ukhttp://www.cems.uwe.ac.uk/~pchatter/2010/dm Lecture 10 : XML & XML Databases
Definition Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents. Extensible Markup Language (XML) 1.0 (Third Edition) W3C Recommendation 04 February 2004 UFIE8K-15-M Data Management 2008
So what is it really? A document syntax (markup) standard for text documents that is simple and open (non-proprietary) for electronic data exchange and storage. It is flexible and eXtendable (Xml) because it allows users to create their own vocabularies (new markup languages) - no fixed set of tags as in HTML or XHTML. XML documents contain only data delimited by tags – no formatting instructions or style. UFIE8K-15-M Data Management 2008
A little history Developed by an XML Working Group formed under the auspices of the World Wide Web Consortium (W3C) in 1996. A subset of SGML (Standard Generalized Markup Language) originally designed to meet the challenges of large-scale electronic publishing. XML now adopted in fields as diverse as law, healthcare, insurance, multimedia, web publishing, EDI, telecommunications, aeronautics, engineering, software, hospitality, tourism, retail, stock trading, etc. etc. etc. ……… UFIE8K-15-M Data Management 2008
Design goals The original design goals for XML were: - that it should be straightforwardly usable over the Internet. - that it should support a wide variety of applications. - that it be compatible with SGML. - that it should be easy to write programs which process XML documents. - that the number of optional features in XML were to be kept to the absolute minimum, ideally zero. - that XML documents should be human-legible and reasonably clear. - that the XML design would be prepared quickly. - that the design of XML would be formal and concise. - that XML documents would be easy to create. - that terseness in XML markup was to be of minimal importance. UFIE8K-15-M Data Management 2008
Example XML document <?xml version="1.0" encoding="UTF-8"?> <patient nhs-no="7503557856"> <name> <first>Joseph</first> <middle>Michael</middle> <last>Bloggs</last> <previous /> <preferred>Joe</preferred> </name> <title>Mr</title> <address> <street>2 Gloucester Road</street1> <street /> <street /> <city>Bristol</city> <county>Avon</county> <postcode>BS2 4QS</postcode> </address> <tel> <home>0117 9541054</home> <mobile>07710 234674</mobile> </tel> <email>joe.bloggs@email.com</email> <fax /> </patient> UFIE8K-15-M Data Management 2008
Other formats pipe dilimited nhs-no|first|middle|last|previous|preferred|………………. |email|fax 7503557856|Joseph|Michael|Bloggs|||Joe|………………….|joe.bloggs@email.com| relational table Patient nhs-no 7503557856 first Joseph middle Michael UFIE8K-15-M Data Management 2008
patient nhs-no 7503557856 address fax title name tel Mr first previous middle last preferred street1 street3 city county postcode street2 BS2 4QS Bristol Avon 2 Gloucester Rd Joseph Michael Bloggs Joe mobile home 07710234674 01179541054 Tree view of example XML document (all xml documents are hierarchical in structure) KEY element attribute content UFIE8K-15-M Data Management 2008
Well-formed XML documents (1) Every XML document must be well-formed and must therefore adhere to the following rules (among others): • Every start-tag must have a matching end tag. • Elements may nest but must not overlap. <name>Anna<em>Coffey</em></name> - √ <name><em>Anna</name>Coffey</em> - × • There must be exactly one root element. • Attribute values must be quoted. • An element must not be quoted. • Comments and processing instructions may not appear inside tags. • No unescaped < or & signs may occur in the character data of an element. Note: A XML document may be well-formed but not valid. A valid document requires a declaration that identifies a Document Type Definition (DTD) or Schema that the document conforms to. This ensures that the document meets various grammar rules for each of its elements and attributes, their order and the values that are allowed. A validating parser can check the document to ensure these rules are met. We will look at XML Schemas in some detail in the next lecture. UFIE8K-15-M Data Management 2008
Well-formed XML documents (2) Element names are case sensitive - <NAME>, <name>, <Name> & <NaMe> are four different element types. No white spaces in element name - <First Name> not allowed; <First_Name> OK. Element names cannot start with the letters “XML” or “xml” – reserved terms. Element names must start with a letter or a underscore. Element names cannot start with a number but numbers may be embedded within an element name - <2you> not allowed; <me2you> is OK. Attribute names are constrained by the above rules for element names. Entity references are used to substitute specific characters. There are five predefined entities built into XML: Entity Char Notes & & Do not use inside processing instructions < < Use inside attribute values quoted with “. > > Use after ]] in normal text and inside processing instruction. " “ Use inside attribute values quoted with “. ' ‘ Use inside attribute values quoted with ‘. UFIE8K-15-M Data Management 2008
XML Namespaces • Namespaces serve two functions in the XML specification: • To distinguish between elements and attributes from two different vocabularies with different meanings that might share the same name and hence avoid naming collisions. • To group all the related attributes from a single XML application together so that software can easily recognise them. Consider the following fragments from two different documents: <name>Bernadette Coffey</name> and <name>Hegel in a Nutshell</name> The first <name> element refers to the name of a person and the second to the name of a book. If we were to build a merged document (say Bernadette’s reading list) we will have a collision since there are two <name> elements with different meanings. Namespaces can distinguish between the two by using prefixes. <student:name>Bernadette Coffey</student:name> and <book:name>Hegel in a Nutshell</book:name> Each element has a prefix corresponding to a uniform resource identifier (URI) that uniquely identifies the namespace e.g. <student xmlns = http://www.uwe.ac.uk/CEMS/Students> and <book xmlns = http://www.uwe.ac.uk/Library/Books> BUT – don’t confuse URI’s with URL’s. URL’s are a subset of URI’s that locate resources based on a network filename concept. A URL is a path to a file or resource on the Web. A URI used as a namespace is simply a unique name. UFIE8K-15-M Data Management 2008
XML Applications (1) XSLT – Extensible Stylesheet Language Transformations is an application for specifying rules which transform one XML document into another document. It uses template rules in the stylesheet to match patterns in the input document and when a match is found it writes the template from the rule to the output tree. UFIE8K-15-M Data Management 2008
XML Applications (2) XLinks - is the XML Linking Language. It defines how one document links to another. It is divided into two parts XLinks and XPointer (which identifies a particular part of the document (re: anchors in HTML)). XPath – XPath is a non-XML language for identifying particular parts of an XML document. It is designed to be used in conjunction with the Extensible Stylesheet Language Transformations (XSLT) and XPointer. XForms – is the W3C’s name for a specification of Web forms that can be used with a wide variety of platforms including desktop computers, hand helds, information appliances and even paper. XQuery – an XML based query language to extract data from real or virtual documents providing the needed interaction between the Web and databases. SVG – Scalable Vector Graphics. A XML application which describes vector graphics data for JPEG, GIF and PNG for distribution and display over the web. Other applications (and the list is growing rapidly) include – XML Signature, XML Encryption, Web Services (SOAP, WDSL & UDDI), XML Key Management, Synchronized Multimedia Integration Language (SMIL), etc. etc. etc. UFIE8K-15-M Data Management 2008
XML Vocabularies XHTML – the Extensible HyperText Markup Language which reproduces and extends HTML. An XHTML document conforms to all rules required of a well formed XML document and drops many of the weak features of HTML e.g. the <font> tag. WML – the Wireless Markup Language is a strict HTML type vocabulary for use with wireless-enabled devices such as mobile phones, PDA’s & pagers. InkML – For representing digital ink data that is input with a pen. MathML – For the inclusion of mathematical formulas in web pages and machine to machine communications. CML – Chemical Markup Language is a XML vocabulary for representing molecular and chemical information. A formula can be transformed into a graphic represenation for displaying on a web page. Others standardized vocabularies include the Banking Industry Technology Secretariat (BITS); Financial Exchange (IFX); Bank Internet Payment System (BIPS); Telecommunications Interchange Markup (TIM); Common Business Library (xCBL); Electronic Business XML Initiative (ebXML); Product Data Markup Language (PDML); Financial Information eXchange protocol (FIX); The Text Encoding Initiative (TEI) and hundreds of others. UFIE8K-15-M Data Management 2008
Relational v. xml approach to data UFIE8K-15-M Data Management 2008
Approaches to structuring xml (1) • storing XML in VARCHAR or BLOBS • offers xpath/xquery but not much else • storing XML in shredded form • XML document is decomposed according to specified rules into one or more relational tables and reconstructed back on retrieval Pros and when to use it: The XML schema is stable; XML is only used as transfer format and document structure is not relevant Incoming XML data must be integrated with existing relational data; the structure of XML documents is simple to allow for easy mapping; performance of query is more important then insert. Cons and when to avoid: Document structure is too complex to be mapped into tables; performance of insert is important; document structure needs to be preserved; full retrieval of documents is frequent; XML schema frequently changes or does not exist; data in XML document is sparse UFIE8K-15-M Data Management 2008
Approaches to structuring xml (2) • native xml db native XML database is a system which processes and stores XML data using XML data model. A true native XML database system uses trees of nodes as the fundamental storage and processing model • Pros: No mapping between data models; document structure and order preserved; XQuery & XPath can be processed without translation to SQL; no parsing required at query or update time; documents with or without schemas can be stored in native store without the need to adjust for complex mappings; sub document update is fast • When not: the document collection isn’t order centered; applications must run XQueries that can easily be expressed in SQL; where there is no need to construct XML documents that are different from the ones that were inserted into the database. UFIE8K-15-M Data Management 2008