620 likes | 633 Views
Learn about XML, a self-describing language for specifying the contents and formatting of web pages and transferring information between different programs and computers.
E N D
XML:Extensible Markup Language 2019/05/07 Garcia
XML: Extensible Markup Language HTML • Write the documents that specify the contents and formatting of Web pages. • Static Web pages XML • Provide information about the structure and meaning of the data in the Web pages. • Transfer information in self-describing textual files among various programs on different computers when needed. Self-describing documents: Provide descriptive information.
Outline 13.1 Structured, Semistructured, and Unstructured Data 13.2 XML Hierarchical (Tree) Data Model 13.3 XML Documents, DTD, and XML Schema 13.4 Storing and Extracting XML Documents from Databases 13.5 XML Languages 13.6 Extracting XML Documents from Relational Databases 13.7 XML/SQL: SQL Functions for Creating XML Data 13.8 Summary
13.1 Structured, Semistructured, and Unstructured Data All data follows the structures and constraints specified in the schema, e.g. Figure5.6. Structured data:
13.1 Structured, Semistructured, and Unstructured Data This data may have a certain structure, but not all the information collected will have the identical structure. Semistructured data: Author names Books or technical reports Title Bibliographic References Citation Research articles in journals or conference proceedings Proceedings Page numbers Complete journal issues or conference proceedings ………… Research Project
13.1 Structured, Semistructured, and Unstructured Data ◎The difference between structured and semistructured data ※ How the schema constructs (e.g. the names of attributes, relationships, and entity types) are handled. • Design the database schema carefully in order to define the database structure. • Each record follows the same format as the other records. Structured data Semistructured data • No prodefined schema. • each data object can have different attributes not known in advance. • ⇒self-describing data.
13.1 Structured, Semistructured, and Unstructured Data Labels(tags): The schema names: the names of attributes, object types (or entity types or classes), and relationships. Data from Figure5.6
13.1 Structured, Semistructured, and Unstructured Data The internal nodes: Individual objects or composite attributes. The leaf nodes represent actual data values of simple.
13.1 Structured, Semistructured, and Unstructured Data The leaf nodes: Actual data values of simple.
13.1 Structured, Semistructured, and Unstructured Data • Similarity: Ability to represent complex objects and nested structures. • Differences: • The schema information is intermixed with the objects and their data values in the same data structure. • There is no requirement for a predefined schema to which the data objects must conform.
13.1 Structured, Semistructured, and Unstructured Data There is very limited indication of the type of data. e.g. Web pages in HTML that contain some data. Unstructured data: <HTML> <HEAD> ... </HEAD> </HTML> <...> HTML tag <…> start tag matches to </…> end tag ⃝ Format the documents ✕ Specify the meaning of the various data elements.
The boundaries of the document <HTML> <HEAD> ... </HEAD> <BODY> <H1>List of company projects and the employees in each project</H1> <H2>The ProductX project:</H2> <TABLE width=“100%” border=0 cellpadding=0 cellspacing=0> <TR> <TD width=“50%”><FONT size=“2” face=“Arial”>John Smith:</FONT></TD> <TD>32.5 hours per week</TD> </TR> <TR> <TD width=“50%”><FONT size=“2” face=“Arial”>Joyce English:</FONT></TD> <TD>20.0 hours per week</TD> </TR> </TABLE> <H2>The ProductYproject:</H2> <TABLE width=“100%” border=0 cellpadding=0 cellspacing=0> <TR> <TD width=“50%”><FONT size=“2” face=“Arial”>John Smith:</FONT></TD> <TD>7.5 hours per week</TD> </TR> <TR> <TD width=“50%”><FONT size=“2” face=“Arial”>Joyce English:</FONT></TD> <TD>20.0 hours per week</TD> </TR> <TR> <TD width=“50%”><FONT size=“2” face=“Arial”>Franklin Wong:</FONT></TD> <TD>10.0 hours per week</TD> </TR> </TABLE> ... </BODY> </HTML> Various commands that will be used elsewhere in the document e.g. script functions, formatting styles The following text is to be displayed as a table Be displayed as a level 2 heading The individual table data elements Each table row The document text, the markup tags and references to other objects
13.2 XML Hierarchical (Tree) Data Model Basic object XML XML document construct Attributes Elements ………… Complex elements are constructed from other elements hierarchically Simple elements contain data values The term attribute in XML is notused in the same manner as is customary in database terminology.
13.2 XML Hierarchical (Tree) Data Model The leaf nodes⇒ The simple elements:<Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>, <First_name>, and <Hours>. The internal nodes⇒The complex elements: <Projects>, <Project>, and <Worker>.
13.2 XML Hierarchical (Tree) Data Model Three main types of XML documents • Data-centric XML documents Have many small data items that follow a specific structure. ⇒semistructured data or structured data. • Document-centric XML documents Documents with large amounts of text and few or no structured data elements. • Hybrid XML documents May have parts that contain structured data and other parts that are predominantly textual or unstructured.
13.2 XML Hierarchical (Tree) Data Model • Schemaless XML documents:Do not follow a predefined schema of element names and corresponding tree structure. • If an XML document conforms to a predefined XML schema or DTD (13.3) • ⇒structured data • XML allows documents that do not conform to any schema • ⇒semistructured data and are schemaless XML documents • When the value of the standalone attribute in an XML document is yes • ⇒standalone and schemaless
13.3 XML Documents, DTD, and XML Schema 13.3.1 Well-Formed and Valid XML Documents and XML DTD Conditions for the well-formed XML document • Must start with an XML declaration to indicate the version of XML being used • Must follow the syntactic guidelines of the tree data model. • ⇒A single root element, and every element must include a matching • pair of start and end tags within the start and end tags of the parent • element.
13.3.1 Well-Formed and Valid XML Documents and XML DTD DOM (Document Object Model) Allows programs to manipulate the resulting tree representation. However, the whole document must be parsed beforehand. SAX (Simple API for XML) /Event-based processing Allows processing of XML documents on the fly by notifying the processing program through callbacks whenever a start or end tag is encountered. Easier to process large documents and streaming XML documents.
13.3.1 Well-Formed and Valid XML Documents and XML DTD A valid XML document • The element names used in the start and end tag pairs must follow the structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. • Any valid documents conforming to this DTD should follow the specified structure.
13.3.1 Well-Formed and Valid XML Documents and XML DTD A special syntax for specifying DTD files A name is given to the root tag of the document +required multival- ued (repeating) element: Can be repeated one or more times. *optional multivalued (repeating) element: Can be repeated zero or more times. ? optional single-valued (nonrepeating) element: Can be repeated zero or one times. Required single-valued (nonrepeating) element: An element appearing without any of the preceding three symbols must appear exactly once in the document.
13.3.1 Well-Formed and Valid XML Documents and XML DTD The parentheses include names of other elements ⇒the children of the element Specify attributes of the element The parentheses include #PCDATA or one of the other data types ⇒a leaf node • Attributes can also be used to hold the values of simple data elements of type #PCDATA. • Parentheses can be nested when specifying elements. • A bar symbol ( e1| e2) specifies that either e1or e2can appear in the document.
13.3.1 Well-Formed and Valid XML Documents and XML DTD Check whether an XML conform to a DTD <?xml version = “1.0” standalone = “no”?> <!DOCTYPE Projects SYSTEM “proj.dtd”> DTD file’s name include the DTD document text at the beginning of the XML document itself.
13.3.1 Well-Formed and Valid XML Documents and XML DTD Similar to a foreign key This attribute is optional and provide a default value. Similar to a collection or list of foreign key
13.3.1 Well-Formed and Valid XML Documents and XML DTD The limitations of XML DTD • The data types in DTD are not very general. • DTD requires specialized processors for special syntax. • All DTD elements must follow the specified ordering, unordered elements are not permitted.
13.3.2 XML Schema XML instance document or XML document A regular XML document that contains both tag names and data values. XML schema document A document that specifies an XML schema. No data values.
13.3.2 XML Schema Schema descriptions and XML namespaces XML namespace: Defines the set of commands (names) that can be used. The file name is assigned to the variable xsd (XML schema description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all XML schema commands (tag names).
13.3.2 XML Schema Annotations, documentation, and language used Provide comments and other descriptions in the XML document. Specify the language being used.
13.3.2 XML Schema Elements and types • The structure of the root element(company) can be specified in xsd:complexType. • This is further specified to be a sequence of departments, employees, and projects using the xsd:sequence structure of XML schema. • Another way to specify an XML schema in Section 13.6.
13.3.2 XML Schema First-level elements in the COMPANY database • First-level elements: department, employee and project. • Empty elements: A tag has only attributes and no further subelements or data within it. • Empty elements are ended with “/>” directly.
13.3.2 XML Schema Specifying element type and minimum and maximum occurrences. If no type attribute is specified ⇒ the element structure can be defined directly following the tag. • Specify the type and multiplicity of each element in any document that conforms to the schema specifications. onforms to the schema specifications. • If specify a type attribute in an xsd:element • ⇒the structure of the element must be described separately.
13.3.2 XML Schema Specifying element type and minimum and maximum occurrences. • Specifying lower and upper bounds on the number of occurrences of an element in any XML document. • If they are not specified • ⇒ the default is exactly one occurrence.
13.3.2 XML Schema Specifying keys Specifies elements that correspond to unique attributes in a relational database. Must specify xsd:selectorand xsd:fieldtags for it to identify the element type that contains the unique element and the element name within it that is unique via the xpathattribute.
13.3.2 XML Schema Specifying keys Primary keys
13.3.2 XML Schema Specifying keys Foreign keys The referenced primary key
13.3.2 XML Schema Specifying the structures of complex elements via complex types (optional) Database attributes where null is allowed ⇒ minOccurs = 0 Multivalued database attributes ⇒ maxOccurs = “unbounded” Each element is given a name and type via the attributes name and type of xsd:element.
13.3.2 XML Schema Composite (compound) attributes These could have been directly embedded within their parent elements.
13. 4 Storing and Extracting XML Documents from Databases Approaches: • Using a file system or a DBMS to store the documents as text. • Work for the DBMS which has a special module for document processing and storing schemaless and document-centric XML documents. • 2. Using a DBMS to store the document contents as data elements. • Work for storing a collection of documents that follow a specific XML DTD or XML schema. • Use mapping algorithm to design a database schema and re-create the XML documents. • If all elements have IDs. • ⇒A table with attributes XDOC(Cid(child), Pid(parent), Etag(the name of the child), Val(the value)).
13.4 Storing and Extracting XML Documents from Databases Approaches: • 3. Designing a specialized system for storing native XML data. • Work for all types of XML documents. • Native XML DBMS. Based on the hierarchical (tree) model and include specialized indexing, querying and data compression techniques. • 4. Creating or publishing customized XML documents from preexisting relational databases. • Use a separate middleware software layer to handle the conversions needed between the relational data and the extracted XML documents. • More details in Section13.6
13.5 XML Languages 13.5.1 XPath: Specifying Path Expressions in XML • An XPath expression generally returns a sequence of items(leaf nodes’ values, elements, attributes) that satisfy a certain pattern as specified by the expression. • The names in the XPath expression = Node names in the XML document tree, possibly with additional qualifier condition. • Two main separators • /: Must appear as a direct child of the previous (parent) tag. • //:Can appear as a descendant of the previous tag at any level. • @:To refer to an attribute name.
13.5.1 XPath: Specifying Path Expressions in XML /company • Returns the company root node and all its descendant nodes. • ⇒The whole XML document. • The XPath query generally includes the file name. • ⇒Allows us to specify any local file name or even any path name that specifies a file on the Web. e.g. The COMPANY XML document is stored at www.company.com/info.XML /company ⇒ doc(www.company.com/info.XML)/company
13.5.1 XPath: Specifying Path Expressions in XML /company/department • Returns all department nodes and their descendant subtrees. • The XPath result that returns multiple nodes will do so in the same order in which the nodes are ordered in the document tree.
13.5.1 XPath: Specifying Path Expressions in XML //employee [employeeSalarygt 70000]/employeeName • Returns all employeeName nodes that are direct children of an employee node • Useful for schemaless XML documents or for documents with many nested levels of nodes.
13.5.1 XPath: Specifying Path Expressions in XML /company/employee [employeeSalarygt 70000]/employeeName • Return the same result as the third one. /company/project/projectWorker [hours ge 20.0] • Returns all projectWorker nodes and their descendant nodes. /company/* • @/*: attributes. • The result can be a sequence of different types of elements.
13.5.1 XPath: Specifying Path Expressions in XML The axes of an XPath expression. • It is possible to move in multiple directions from the current node in the path expression. • Include child of the current node (/), descendent or self at any level of the current node (//), attribute of the current node (@), parent, ancestor (at any level), previous sibling (a node at same level to the left), and next sibling (a node at the same level to the right).
13.5.2 XQuery: Specifying Queries in XML FLWOR expression • FOR <variable bindings to individual nodes (elements)> • zero or more • LET <variable bindings to collections of nodes (elements)> • zero or more • WHERE <qualifier conditions> • most once • ORDER BY <ordering specifications> • most once • RETURN <query result specification> • exactly once
13.5.2 XQuery: Specifying Queries in XML LET $d : = doc(www.company.com/info.xml)FOR $x IN $d/company/project[projectNumber = 5]/projectWorker, $y IN $d/company/employeeWHERE $x/hours gt 20.0 AND $y.ssn = $x.ssnORDER BY $x/hoursRETURN <res> $y/employeeName/firstName, $y/employeeName/lastName, $x/hours </res> • Variables are prefixed with the $ sign. • The FOR clause assigns a variable to range over each of the individual elements in a sequence. • The WHERE clause specifies additional conditions on the selection of items. • The ORDER BY clause specifies that the result elements will be ordered by the value of the hours per week they work on the project in ascending value of hours. • the RETURN clause specifies which elements or attributes should be retrieved from the items that satisfy the query conditions.
13.5.2 XQuery: Specifying Queries in XML The variable $x is bound to each employeeName element that is a child of an employee element, but only for employee elements that satisfy the qualifier that their employeeSalary value is greater than $70,000. It’s an alternative way of retrieving the same elements retrieved by the first query.
13.5.2 XQuery: Specifying Queries in XML Illustrates how a join operation can be performed by using more than one variable. $x variable is bound to each projectWorker. $y variable is bound to each Employee.
13.5.3 Other Languages and Protocols Related to XML • The Extensible Stylesheet Language (XSL) can be used to define how a document should be rendered for display by a Web browser. • The Extensible Stylesheet Language for Transformations (XSLT) can be used to transform one structure into a different structure. • The Web Services Description Language (WSDL) allows for the description of Web Services in XML. • The Simple Object Access Protocol (SOAP) is a platform-independent and programming language-independent protocol for messaging and remote procedure calls. • The Resource Description Framework (RDF) provides languages and tools for exchanging and processing of meta-data (schema) descriptions and specifications over the Web.
13.6 Extracting XML Documents from Relational Databases 13.6.1 Creating Hierarchical XML Views over Flat or Graph-Based Data UNIVERSITY ER schema Suppose that an application needs to extract XML documents for student, course, and grade information from the UNIVERSITY database. The subset of the database